/
Structured Querying of Web Text: A Technical Challenge Structured Querying of Web Text: A Technical Challenge

Structured Querying of Web Text: A Technical Challenge - PowerPoint Presentation

mitsue-stanley
mitsue-stanley . @mitsue-stanley
Follow
389 views
Uploaded On 2018-01-04

Structured Querying of Web Text: A Technical Challenge - PPT Presentation

Michael J Cafarella Christopher Re Dan Suciu Oren Etzioni Michele Banko Presenter Shahina Ferdous ID 1000630375 Date 032310 Querying over Unstructured Data Web Text Documents ID: 619267

exdb invented data web invented exdb web data extraction query scientist edison einstein in1877 edisoninvented thephonograph

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "Structured Querying of Web Text: A Techn..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

Structured Querying of Web Text: A Technical Challenge

Michael J. Cafarella, Christopher Re, Dan Suciu, Oren Etzioni, Michele Banko

Presenter: Shahina Ferdous

ID – 1000630375

Date – 03/23/10Slide2

Querying over Unstructured Data

Web

(Text Documents)

Contains vast amount Text

D

ocuments, which is:UnstructuredAccessed by keywordsLimited search qualitySlide3

Querying over Unstructured Data

Web

Show me some people, what they invented, and the years they died

Keyword-in

Document-outSlide4

Querying over Unstructured Data

Web

List some Scientists with their invention and the years they died

Keyword-in

Document-outSlide5

Structured Querying of web Text

“Show me some people, what they invented, and the years they died”

Scientist

Inventions

Year

P

rob

Kepler

log books

1630

.7902

Heisenberg

matrix mechanics

1976

.7897

Galileo

telescope

1642

.7395

Newton

calculus

1727

.7366

In this paper, they proposed a structured Web query System called extraction

databse

,

ExDB

.

ExDb

uses information extraction (IE) system to extract Data.

As the extracted Data can be

erroneos

,

ExDB

assigns Probability to the

tuples

.Slide6

ExDB Work Flow

…no one could surprising. In1877, Edisoninvented the

phonograph.

Although he…

…didnt surprising. In1877, Edisoninvented thephonograph.Although he……was surprising. In1877, Edisoninvented thephonograph.Although he…Obj1Pred

Obj2

prob

Edison

invented

phonograph

0.97

Morgan

born-in

1837

0.85

Type

Instance

prob

scientist

Einstein

0.99

city

Seattle

0.92

Pred1

Pred2

prob

invented

did-invent

0.85

invented

created

0.72

Facts

Types

Synonyms

RDBMS

Query

middleware

invented(Edison ?e, ?i)

1. Run extractors

2. Populate data model

3.

Query Processing &

Applications

WebSlide7

Information Extraction

ExDB extracts several base-level concepts through combination of existing IE techniques: Objects are Data values in the system. Examples: Einstein, telephone, Boston,

Light-bulb, etc.

Predicates represents binary relation between pair of objects.

Examples: discovered (Edison, phonograph), born-in (A. –Einstein, Switzerland) and sells (Amazon, PlayStation) etc. Semantic types represents unary relation of objects. Examples: city (Boston), city (New-York) and electronics (dvd-player) etc.Slide8

Information Extraction

ExDB should also extract more series of relationships to make queries even easier for the user: Synonyms denote equivalent objects, predicates or types.

Examples: Einstein and A. –Einstein almost certainly refer to same object.

Also, invented and has-invented refer to same predicate.

Inclusion Dependencies describes subset relationship between two predicates. Examples: invented (?x, ?y )  discovered (?x, ?y). Functional Dependencies are useful to answer query with negation or why an object is not an answer.For example, a probabilistic FD indicating a person can only be born in one Country: born-in(?x, <country> ?y): ?x -> ?y p=0.95 “All Scientists born in Germany that taught at Princeton”. If after receivingthe answers, they ask again to the system “Why Einstein is not an answer?”. Using the above FD, the system will answer: “As born-in (Einstein, Switzerland)” and FD tells a person can only born in oneCountry, therefore probability of born-in (Einstein, Germany) is very low.Slide9

Information Extraction

Example

Description

IE technique

invented(Edison, phonograph)

Arity-2 fact

TextRunner

<scientist> Einstein

Type (

hypernymy

)

KnowItAll

has-invented = invented

Synonymy

DIRT

invented

discovered

ID (

troponymy

)

?

FD:

has-

capital(x

,

y

)

 has-capital(y)FD (rule)?Slide10

ExDB Work Flow

…no one could surprising. In1877, Edisoninvented thephonograph.

Although he…

…didnt

surprising. In1877, Edisoninvented thephonograph.Although he……was surprising. In1877, Edisoninvented thephonograph.Although he…Obj1PredObj2

prob

Edison

invented

phonograph

0.97

Morgan

born-in

1837

0.85

Type

Instance

prob

scientist

Einstein

0.99

city

Seattle

0.92

Pred1

Pred2

prob

invented

did-invent

0.85

invented

created

0.72FactsTypesSynonymsRDBMSQuerymiddlewareinvented(Edison ?e, ?i)

1. Run extractors2. Populate data model3. Query Processing & Applications

WebSlide11

Populate Data Model

Obj1

Pred

Obj2

prob

Edison

invented

phonograph

0.97

Morgan

born-in

1837

0.85

Type

Instance

prob

scientist

Einstein

0.99

city

Boston

0.92

Pred1

Pred2

prob

invented

did-invent

0.85

invented

created

0.72

Inclusion

Includer

prob

invented

discovered

0.81

Seattle

Washington

0.65

LHS

RHS

prob

capital(x, y)

capital(y)

0.77

born-

in(x

)

country(y

)

0.95

Facts

Types

Synonyms

IDs

FDs

It was big news when

Edison invented

the

phonograph

He visited

cities

such as

Boston

and

New York

.

We all know that

Edison

did-

invent

the

light bulb

.

In 1877

Edison

created

the

phonograph

.

Morgan was born-in 1837 into a prosperous mercantile-banking family

Einstein is

one of the best known scientists and intellectuals of all time.

For fact extraction

ExDB

uses unsupervised system called

TextRunner

.

TextRunner

generates a large set of extraction while running on entire corpus of text.

Unlike other IE systems, it does not require a set of target predicates specified beforehand, rather it starts by using a heavy weight linguistic parser to generate high quality extraction triples.

Later they use these high quality triples as the training set to generate a light weight extraction classifier that can run on entire web-scale corpus

TextRunner

For type extraction

ExDB

uses the

KnowItAll

system.

KnowItALL

searches the entire corpus to extract

hypernym or “is-a” relationships. For example: it extracts city (Boston) from “cities such as Seattle and Boston”.Assign each extraction a probability based on its frequency (or search engine hit count).

knowItAll

ExDB uses DIRT algorithm to extract predicate synonyms.DIRT computes the degree to which the argument pairs of two predicates coincide. For example, invented and has-invented will overlap many argument pairs like Edison/Light-bulb or Einstein/theory-of-relativity.

DIRTSlide12

ExDB Work Flow

…no one could surprising. In1877, Edisoninvented thephonograph.

Although he…

…didnt

surprising. In1877, Edisoninvented thephonograph.Although he……was surprising. In1877, Edisoninvented thephonograph.Although he…Obj1Pred

Obj2

prob

Edison

invented

phonograph

0.97

Morgan

born-in

1837

0.85

Type

Instance

prob

scientist

Einstein

0.99

city

Seattle

0.92

Pred1

Pred2

prob

invented

did-invent

0.85

invented

created

0.72

Facts

Types

Synonyms

RDBMS

Query

middleware

invented(Edison

?

e

, ?

i

)

1. Run extractors

2. Populate data model

3.

Query

Processing &

Applications

WebSlide13

ExDB Queries

ExDB proposes the users to query over the web Data model using Datalog-like notation.

Example:

q(?i

) :- invented(Edison, ?i) returns all inventions by Edison.Example constranits: q(?x, ?y) :- died-in(<Scientist> ?x, 1955?y)Example query for locally available inexpensive electronics: q(?x, ?y, ?z) :- for-sale-in(<electronics> ?x, Seattle ?y), costs (?x, ?z), (?z < 25)Another example can be: q(?x, ?y, ?z) :- invented(<scientists> ?x, ?y), died-in (?x, <year> ?z), (?z < 1900)

Example of projection queries: q(?s) :- invented(<scientist> ?s, ?i)Slide14

Query Processing

Non-projecting queriesInvolves a series of join against tables in the Web Data ModelProbability of a joined tuple is the product of the individual tuple’s probabilitiesSelect top-

k

queries ranked by their probability as results.

ObjectClasseinsteinscientistbostoncitybohrscientistfrancecountrycuriescientistBugs bunnyscientistObject1PredicateObject2einsteininventedrelativity

1848Was-year-ofrevolutionedisoninventedphonographdukakisvisited

bostoneinsteindied-in

1955

humans

have

Cold-fusion

prob

0.99

0.98

0.95

0.92

0.91

prob

0.99

0.97

0.96

0.93

0.92

0.01

0.01

Types

Facts

Example:

q

(?x, ?y, ?z) :- invented (<scientist> ?x, ?y), died-in (?x, <year> ?z).ScientistInventedDied-inprobeinsteinrelativity19550.90…Slide15

Projecting queries

q

(?

s

) :- invented (<scientist> ?s, ?i) rank scientists according to the probability of the scientist invented something without caring much about the actual invention. Need to compute a disjunction of m probabilistic events. A scientist Tesla appears in the output q, if the tuple invented (Tesla, I0) is in the database. There can be many inventions I1, …, Im for Tesla such as invented (Tesla, Ii). Any of these are sufficient to return Tesla as an answer for q. As m can be very large, a large number of very low probability extractions can unexpectedly

result in a quite large probability. Therefore, try to abstract panel of experts, where an expert is a tuple with a score such as Invented (tesla, Fluroescent-Lighting), 0.95, which determine the probability of its appearing in

q. Slide16

Result of Projecting Queries

q(?s) :- invented(<scientist> ?s, x)

Scientist inventedSlide17

ExDB Prototype

Web crawl:

90M

pages

Facts: 338M tuples, 102M objectsTypes:

6.6M instancesSynonyms:

17k

pairs

No IDs or

FDs

yetSlide18

Applications

ExDB’s extracted Data are not meant to be examined directly, rather they are used to build topic-specific tables so that human user can appreciate.

A synthetic table about scientists, generated by merging answers from Died-in(<scientist> ?

x

, ?y), invented(<scientist> ?x, ?y), published(<scientist> ?x, ?y) and taught(<scientist> ?x, ?y).If it is possible to automatically generate an ExDB query from keywords, it is possible to build a very powerful query system.It is possible to build web Data cube over the large amount of read only structured Data of ExDB

.Slide19

Alternative Models

Schema Extraction Model, intends to find out single best schema for the entire set of extractions to transform the web Text into a traditional relational database

Three good criteria for schema extraction are:

Simplicity (few tables).

Completeness (All extractions appear in the output).Fullness ( output database has no NULLs).Slide20

Alternative Models

Text Query Model does not perform any information extraction at all, rather offers a descriptive query language to generates answers for users query very quickly.

Extract

city/date

tuples from band’s website.Indicate the city where she lives.Compute the dates when the band’s city and her own city are within 100 miles of each other.User’s QuerySlide21

Questions?

Thank You