/
Big Text: Big Text:

Big Text: - PowerPoint Presentation

tawny-fly
tawny-fly . @tawny-fly
Follow
385 views
Uploaded On 2017-11-13

Big Text: - PPT Presentation

f rom Names and Phrases t o Entities and Relations Gerhard Weikum Max Planck Institute for Informatics Saarbrücken Germany httpwwwmpiinfmpgdeweikum ID: 605273

data amp big text amp data text big knowledge entities entity covered web http mpg mpi media analytics nerd

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "Big Text:" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

Big Text:from Names and Phrasesto Entities and Relations

Gerhard Weikum

Max Planck Institute

for

Informatics

Saarbrücken, Germany

http://www.mpi-inf.mpg.de/~weikum/Slide2

From Natural-Language Text to Knowledge

Web

Contents

Knowledge

knowledge

acquisition

intelligent

interpretation

m

ore

knowledge

,

analytics

,

insightSlide3

Cyc

TextRunner/

ReVerb

WikiTaxonomy/

WikiNet

SUMO

ConceptNet

5

BabelNet

ReadTheWeb

http://richard.cyganiak.de/2007/10/lod/lod-datasets_2011-09-19_colored.png

Web

of

Data & Knowledge

(

Linked

Open Data)

> 50 Bio.

s

ubject-predicate-object

triples

from

> 1000

sourcesSlide4

10M

entities

in

350K

classes

120M

facts

for

100

relations

100

languages

95%

accuracy

4M

entities

in

250

classes

500M

facts

for

6000

properties

live

updates

40M

entities

in

15000

topics

1B facts for 4000

properties core of

Google Knowledge

Graph

Web of Data & Knowledge

600M

entities

in

15000

topics

20B

facts

> 50 Bio. subject-predicate-object triples from > 1000 sourcesSlide5

Web

of

Data & Knowledge

> 50 Bio.

s

ubject

-

predicate

-

object

triples

from

> 1000

sources

taxonomic

knowledge

factual

knowledge

temporal

knowledge

evidence

& belief

knowledget

erminological knowledge

Yimou_Zhang

type

movie_director

Yimou_Zhang

type

olympic_games_participant

movie_director subclassOf

artist

Yimou_Zhang

directed Shanghai

Triad

Li Gong actedIn

Shanghai Triad

Yimou_Zhang

memberOf Beijing_film_academy

validDuring

[1978, 1982]

Yimou_Zhang

„was classmate

of“ Kaige_Chen

Yimou_Zhang

had love

affair with

“ Li_Gong

Li_Gong knownAs

„China‘s

most beautiful

“Slide6

Knowledge for Intelligent

Applications

Enabling

technology

for

:

disambiguation

in

written

&

spoken

natural

language

deep reasoning

(e.g. QA to

win quiz

game)machine

reading

(e.g. to

summarize book or

corpus)

semantic search

in

terms of entities&relations (not keywords&pages

)entity-level

linkage

for Big Data & Big Text analyticsSlide7

Big Text Analytics: Who Covered Whom?1000‘s of Databases100 Mio‘s of Web Tables100 Bio‘s of Web & Social Media Pages

Elvis Presley Frank Sinatra

My

Way

Robbie Williams Frank Sinatra

My

Way

Sex

Pistols

Frank Sinatra

My

Way

Frank Sinatra Claude Francois

Comme

d‘Habitude

. . . . . . . . . . . . . . .

Musician

Original Title

in different

language

,

country

,

key

, …

w

ith

more

sales

,

awards

,

media

buzz

, …

.....Slide8

Big Text Analytics: Who Covered Whom?1000‘s of Databases100 Mio‘s of Web Tables100 Bio‘s of Web & Social Media Pages

Claudia

Leitte

Bruno Mars

Billionaire

Only

Won Bruno Mars

I

wanna

be

an

engineer

Yip

Sin-Man Celine Dion

我心永恒

Norah

Jones Bob Dylan

Forever

Young

Beyonce

Alphaville

Forever

Young

Jay-Z

Alphaville

Forever

Young

. . . . . . . . . . . . . . .

Musician

Original Title

in different

language

,

country

,

key

, …

w

ith

more

sales

,

awards

,

media

buzz

, …

.....Slide9

Big Text Analytics: Who Covered Whom?1000‘s of Databases100 Mio‘s of Web Tables100 Bio‘s of Web & Social Media Pagesin different language, country, key, …with more sales, awards,

media

buzz

, …

.....

Sex

Pistols

My

Way

Frank Sinatra

My

Way

Only

Won I

wanna

be an

engineer Beyonce

Forever Young

Musician

PerformedTitle

Francis Sinatra

My

Way

Paul Anka

My

Way

Bruno Mars

Billionaire

Bob Dylan

Forever

Young

Musician

CreatedTitle

Name Group

Sid

Vicious

Sex

Pistols

Bono

U2

Name Event

Beyonce

Grammy

Norah

J. Jobs

Mem

.

Claudia L. FIFA 2014Slide10

Big Text Analytics: Who Covered Whom?1000‘s of Databases100 Mio‘s of Web Tables100 Bio‘s of Web & Social Media Pagesin different language, country, key, …with more sales, awards,

media

buzz

, …

.....

Sex

Pistols

My

Way

Frank Sinatra

My

Way

Claudia

Leitte

Famo$a

Only

Won I wanna

be an engineer

Beyonce

Forever Young

Musician

PerformedTitle

Francis Sinatra

My

Way

Paul Anka

My

Way

Bruno Mars

Billionaire

Travie

McCoy

Billionaire

Bob Dylan

Forever Young

Musician CreatedTitle

Big Data

Volume

Velocity

Variety

Veracity

Big Data & Big Text

Volume

Velocity Variety

VeracitySlide11

Big Data & Big Text Analytics Who covered which other singer?Who influenced which other musicians

?

Entertainment:

Drugs (

combinations

)

and

their

side effects

Health:

Politicians‘ positions on

controversial topics and their

involvement with industry

Politics:

Customer opinions on

small-company products,gathered from social media

Business:

Identify relevant contents

sources Identify

entities of interest & their

relationships Position in time &

space Group and aggregate

Find insightful patterns & predict

trendsGeneral Design Pattern:

Trends in society,

cultural factors, etc.

Culturomics:Slide12

OutlineLovely NERD The New Chocolate

Conclusion

Introduction

The Dark SideSlide13

Lovely NERDSlide14

Named

Entity

Recognition &

Disambiguation

(NERD)

p

rior

popularity

o

f

name-

entity

pairs

Zhang

played

in

Lee‘s

e

pic

masterpiece

,

w

ith

Ma‘s

score.

Ma also

covered

t

he

Ecstasy.

contextual

similarity

:

mention

vs.

entity

(

bag-of-words

,

l

anguage

model

)Slide15

Named

Entity

Recognition &

Disambiguation

(NERD)

Zhang

played

in

Lee‘s

e

pic

masterpiece

,

w

ith

Ma‘s

score.

Ma also

covered

t

he

Ecstasy.

Coherence

of

entity

pairs

:

s

emantic

relationships

s

hared

types

(

categories

)

overlap

of

Wikipedia linksSlide16

Named

Entity

Recognition &

Disambiguation

(NERD)

Zhang

played

in

Lee‘s

e

pic

masterpiece

,

w

ith

Ma‘s

score.

Ma also

covered

t

he

Ecstasy.

Coherence

: (partial)

overlap

o

f

(

statistically

weighted

)

entity-specific

keyphrases

Crouching

Tiger

Memoirs

of

a Geisha

Chinese Film

Actress

Grammy Award

Best

Classical

Album

Tan Dun

Composition

Morricone Tribute

Good

, Bad,

Ugly

Western Movie Score

Ennio Morricone

Metallica

cover

song

Crouching

Tiger

Academy Award

Chinese American

Score

by

Tan DunSlide17

Named

Entity

Recognition &

Disambiguation

Zhang

played

in

Lee‘s

e

pic

masterpiece

,

w

ith

Ma‘s

score.

Ma also

covered

t

he

Ecstasy.

NED

algorithms

compute

m

ention-to-entity

mapping

over

weighted

graph

of

candidates

b

y

popularity

&

similarity

&

coherence

KB

provides

building

blocks

:

n

ame-

entity

dictionary

,

relationships

,

types

,

t

ext

descriptions

,

keyphrases

,

s

tatistics

for

weightsSlide18

Joint Mapping Build mention-entity graph or probabilistic factor graph from knowledge and statistics in KB Compute

high-

likelihood

mapping

(ML

or

MAP)

or dense subgraph

(with high total edge weight) such that

: each m is connected

to exactly one

e (or at most

one e)

90

30

5

100

100

50

20

50

90

80

90

30

10

10

20

30

30

m1

m2

m3

m4

e

1

e2

e3

e4

e5

e6Slide19

Coherence Graph Algorithm

90

30

5

100

100

50

50

90

80

90

30

10

20

10

20

30

30

140

180

50

470

145

230

D5 Overview May 14, 2013

19

Compute

dense

subgraph

to

maximize

min

weighted

degree

among

entity

nodes

such

that

:

each

m

is

connected

to

exactly

one

e (

or

at

most

one

e)

Approx

.

algorithms

(

greedy

,

randomized

, …),

hash

sketches

, …

82%

precision

on CoNLL‘03

benchmark

Open-

source

software

& online

service

AIDA

http://www.mpi-inf.mpg.de/yago-naga/aida/

m1

m2

m3

m4

e

1

e2

e3

e4

e5

e6

[J. Hoffart et al.:

EMNLP‘11, CIKM‘12,

WWW’14]Slide20

NERD Online ToolsJ. Hoffart et al.: EMNLP 2011, VLDB 2011https://d5gate.ag5.mpi-sb.mpg.de/webaida/P. Ferragina, U. Scaella: CIKM 2010http://tagme.di.unipi.it/P.N. Mendes, C. Bizer et al.: I-Semantics 2011, WWW 2012http://spotlight.dbpedia.org/demo/index.htmlD. Milne, I. Witten: CIKM 2008http://wikipedia-miner.cms.waikato.ac.nz/demos/annotate/L. Ratinov, D. Roth, D. Downey

, M. Anderson: ACL 2011

http://

cogcomp.cs.illinois.edu/page/demo_view/Wikifier

D.

Odijk

, E.

Meij

, M. de

Rijke:OAIR 2013

http://semanticize.uva.nlReuters Open Calais: http://viewer.opencalais.com/

Alchemy API: http://www.alchemyapi.com/api/demo.html

Slide21

NERD at Workhttps://gate.d5.mpi-inf.mpg.de/webaida/Slide22

Research Challenges & OpportunitiesHandling long-tail and emerging entitiest

o

complement

and

continuously

update KB

key for KB life-cycle

management

Entity

name disambiguation in

difficult situations

Short and noisy

texts about long-tail

entities in social media

Efficient

interactive & high-throughput

batch NERD

a day‘s news, a month‘s

publications, a decade‘s archive

Web-

scale

entity

linkage with high quality

across text sources

, linked data, KB‘s, Web

tables, …Slide23

OutlineLovely NERD The New Chocolate

Conclusion

Introduction

The Dark Side

Slide24

Big Text: the New ChocolateSlide25

Semantic Search over News

s

ee

also J. Hoffart et al.:

demo

@ CIKM‘14

stics.mpi-inf.mpg.deSlide26

Semantic Search over News

stics.mpi-inf.mpg.deSlide27

Semantic Search over News

stics.mpi-inf.mpg.deSlide28

Semantic Search over Newsstics.mpi-inf.mpg.deSlide29

Analytics over Newsstics.mpi-inf.mpg.de/statsSlide30

Machine Reading of Scholarly Papers

https://gate.d5.mpi-inf.mpg.de/knowlife/

[P. Ernst et al.: ICDE‘14]Slide31

Machine Reading of Health Forums

https://gate.d5.mpi-inf.mpg.de/knowlife/

[P. Ernst et al.: ICDE‘14]Slide32

Big Data & Text Analytics:Side Effects of Drug Combinations

http://dailymed.nlm.nih.gov

http://www.patient.co.uk

Deeper

insight

from

both

expert

data

&

social

media

:

a

ctual

side

effects

of drugs…

and drug combinations

risk factors

and complications of (

wide-spread) diseases

alternative therapiesaggregation & comparison

by age, gender, life style, etc.

Structured

Expert Data

SocialMediaSlide33

Machine Reading (Semantic Parsing): from Names & Phrases to Entities, Classes & Relations

Rome

(

Italy

)

Lazio

Roma

AS

Roma

Maestro

Card

Ennio

Morricone

MDMA

l

‘Estasi

d

ell‘Oro

Leonard

Bernstein

Jack

Ma

Yo-Yo

Ma

c

over

of

s

tory

about

b

orn

in

p

lays

for

f

ilm

music

goal

in

football

p

lays

sport

p

lays

musicwestern

movie

WesternDigital

Massachusetts

Ma

played

his

version

of

the

Ecstasy.

The Maestro

from

Rome

wrote

scores

for

westerns

.Slide34

Paraphrases of Relations

Dylan

wrote a

sad song

Knockin

‘ on Heaven‘s

Door, a

cover song

by the

Dead

Morricone

‘s masterpiece

is

the Ecstasy of

Gold, covered by Yo-Yo MaAmy‘s

souly

interpretation of Cupid, a classic piece

of Sam CookeNina

Simone‘s singing of

Don‘t Explain revived

Holiday‘s old song

Cat Power‘s

voice is haunting in her

version of Don‘t Explain

Cale performed Hallelujah

written by

L. Cohen

c

omposed

:

musician

song

covered

:

musician

song

Sequence

Mining

w

ith

Type Lifting(N. Nakashole et al.: EMNLP’12, ACL’13,

VLDB‘12)

composed

:

wrote

,

classic

piece of, ‘s

old song, written

by, composed, …

350 000 SOL

patterns

from

Wikipedia: http://www.mpi-inf.mpg.de/yago-naga/patty/

<singer> covered

<song> <book> covered <event>

Relational phrases are

typed:

SOL patterns

over words, wildcards, POS tags, semantic types:

<

musician

>

wrote

ADJ

piece

<song>

Relational

synsets

(and

subsumptions

):

covered

:

cover

song

, interpretation

of,

singing

of,

voice in

version, … Slide35

wrote scoresr: composed

Disambiguation

for

Entities

,

Classes

& Relations

s

cores

for

westerns

from

Rome

Maestro

Combinatorial

Optimization

by

ILP (

with

type

constraints

etc.)

e:

Rome

(

Italy

)

e: Lazio Roma

e:

MaestroCard

e: Ennio Morricone

c:

conductor

c:soundtrack

r:

soundtrackFor

r:

shootsGoalFor

r:

bornIn

r:

actedIn

c: western

movie

e: Western Digital

w

eighted

edges

(

coherence

,

similarity

, etc.)

(M. Yahya et

al

.: EMNLP’12, CIKM‘13)

c

:

musician

r:

giveExam

ILP

optimizers

like

Gurobi

solve

this

in

secondsSlide36

Research Challenges & OpportunitiesManaging text & data on par: DB&IR integration, finallytext with

SPO/ER

markup

linked

to

DB/KB

records

DB/KB records linked

to provenance, context, fact

spottingsQuery &

analytics language – API ?

Sparql Full-Text, extended

/ relaxed Sparql, data&text cube, …

User-friendly

search & exploration –

UI ? natural language, visual, multimodal, …

Context-aware

“strings, things,

cats“: suggested entities

& categories should consider

input prefix

Comprehensive repository of

relational paraphrases: binary-predicate

counterpart of WordNet – still elusive

despite Patty, WiseNet, Probase,

ReVerb, ReNoun,

Biperpedia, …Slide37

Research Challenges & Opportunities

Managing

text

&

data

on par:

DB&IR

integration

,

finally

t

ext

with

SPO/ER markup linked to DB/KB

recordsDB/KB records linked

to provenance

, context, fact spottings

Query & analytics language – API ?

Sparql Full-Text, extended

/ relaxed Sparql, data&text cube, …

User-friendly

search & exploration – UI ?

natural language speech

, visual, multimodal, …

Context-aware “strings,

things, cats“:

suggested entities &

categories should consider input prefix

Comprehensive repository

of relational paraphrases:

binary-predicate counterpart of WordNet - still elusive

despite Patty, WiseNet,

Probase, ReVerb, ReNoun, Biperpedia

, …Slide38

OutlineLovely NERD The New ChocolateConclusion

Introduction

The Dark Side

Slide39

The Dark Side of Big Data

Nobody

interested

in

your

research

?

We

read

your

papers

!Slide40

search

Internet

p

ublish & recommend

Entity

Linking: Privacy at Stake

Levothroid

shaking

Addison

s disease

………

Nive

concert

Greenland singers

Somalia elections

Steve

Biko

s

earch

engine

Zoe

female 29y

Jamame

s

ocial

network

Nive

Nielsen

Cry

Freedom

discuss

& seek

help

o

nline

forum

female

25-30 Somalia

Synthroid

tremble

……….

Addison

disorder

……….Slide41

s

ocial

network

Synthroid

tremble

……….

Addison

disorder

……….

search

Internet

Privacy

Adversaries

Linkability

Threats

:

Weak cues:

profiles

,

friends

, etc.

Semantic cues:

health

, taste, queries

Statistical

cues

:

correlations

discuss

& seek

help

p

ublish & recommend

L

evothroid

shaking

Addison

s disease

………

Nive

concert

Greenland

singers

Somalia

elections

Steve

Biko

Nive

Nielsen

female

25-30 Somalia

female 29y

Jamame

o

nline

forum

s

earch

engine

Cry

FreedomSlide42

s

ocial

network

search

s

ocial

network

s

earch

engine

o

nline

forum

Synthroid

tremble

……….

Addison

disorder

……….

Internet

L

evothroid

shaking

Addison’s

disease

………

Nive

concert

Greenland

singers

Somalia

elections

female

25-30 Somalia

Goal:

Automated

Privacy

Advisor

discuss

& seek

help

p

ublish & recommend

Nive

Nielsen

Cry

Freedom

female 29y

Jamame

Privacy

Adviser

(PA):

S

oftware

tool

that

analyses risk

alerts user

advises

user

explains consequences

recommends

policy

changes

Your

queries

may

lead

to

linking

your

identies

i

n

Facebook

and

patient.co.uk !

………….

Would

you

like

to

use

an

anonymization

tool

for

your

search

requests

?

………..

ERC Project

imPACT

(M. Backes, P.

Druschel

, R.

Majumdar

, G. Weikum)

s

ee

also: J.

Biega

et al.: PSBD Workshop @ CIKM‘14 Slide43

Research Challenges & OpportunitiesLong-term privacy management:policies, risks, privacy-utility trade-offs

Explain

risks

,

advise

on

consequences

,

r

ecommend

counter-

measures

and

mitigation

stepsWhich data is

(or can be

linked to become) privacy-critical

?highly user-specific, but needs a global

perspectiveHow

are privacy risks building

up over time?

where is my

data, who has seen

it, who can copy

and accumulate

it

Who

are the

adversaries? How powerful? At which

cost? role of

background knowledge &

statistical learningSlide44

OutlineLovely NERD The New ChocolateConclusion

Introduction

The Dark Side

Slide45

Big Text & Big DataBig Text & NERD: valuable content about entities lifted towards knowledge &

analytic

insight

Machine

Reading:

discover

and

interpret

names

& phrases

as entities, classes, relations

, spatio-temporal

modifiers, sentiments, beliefs, ….Big Data:

interlink natural-language text,

social media, structured data

& knowledge bases, images,

videosa

nd help users

coping with privacy

risksSlide46

intelligentinterpretation

Take-Home Message

Web

Contents

Knowledge

more

knowledge

,

analytics

,

insight

k

nowledge

acquisition

Knowledge

„Who

Covered

Whom

?“

and

More!

(

Entities

,

Classes

, Relations)