Michael J Cafarella Christopher Re Dan Suciu Oren Etzioni Michele Banko Presenter Shahina Ferdous ID 1000630375 Date 032310 Querying over Unstructured Data Web Text Documents ID: 619267
Download Presentation The PPT/PDF document "Structured Querying of Web Text: A Techn..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Structured Querying of Web Text: A Technical Challenge
Michael J. Cafarella, Christopher Re, Dan Suciu, Oren Etzioni, Michele Banko
Presenter: Shahina Ferdous
ID – 1000630375
Date – 03/23/10Slide2
Querying over Unstructured Data
Web
(Text Documents)
Contains vast amount Text
D
ocuments, which is:UnstructuredAccessed by keywordsLimited search qualitySlide3
Querying over Unstructured Data
Web
Show me some people, what they invented, and the years they died
Keyword-in
Document-outSlide4
Querying over Unstructured Data
Web
List some Scientists with their invention and the years they died
Keyword-in
Document-outSlide5
Structured Querying of web Text
“Show me some people, what they invented, and the years they died”
Scientist
Inventions
Year
P
rob
Kepler
log books
1630
.7902
Heisenberg
matrix mechanics
1976
.7897
Galileo
telescope
1642
.7395
Newton
calculus
1727
.7366
In this paper, they proposed a structured Web query System called extraction
databse
,
ExDB
.
ExDb
uses information extraction (IE) system to extract Data.
As the extracted Data can be
erroneos
,
ExDB
assigns Probability to the
tuples
.Slide6
ExDB Work Flow
…no one could surprising. In1877, Edisoninvented the
phonograph.
Although he…
…didnt surprising. In1877, Edisoninvented thephonograph.Although he……was surprising. In1877, Edisoninvented thephonograph.Although he…Obj1Pred
Obj2
prob
Edison
invented
phonograph
0.97
Morgan
born-in
1837
0.85
Type
Instance
prob
scientist
Einstein
0.99
city
Seattle
0.92
Pred1
Pred2
prob
invented
did-invent
0.85
invented
created
0.72
Facts
Types
Synonyms
RDBMS
Query
middleware
invented(Edison ?e, ?i)
1. Run extractors
2. Populate data model
3.
Query Processing &
Applications
WebSlide7
Information Extraction
ExDB extracts several base-level concepts through combination of existing IE techniques: Objects are Data values in the system. Examples: Einstein, telephone, Boston,
Light-bulb, etc.
Predicates represents binary relation between pair of objects.
Examples: discovered (Edison, phonograph), born-in (A. –Einstein, Switzerland) and sells (Amazon, PlayStation) etc. Semantic types represents unary relation of objects. Examples: city (Boston), city (New-York) and electronics (dvd-player) etc.Slide8
Information Extraction
ExDB should also extract more series of relationships to make queries even easier for the user: Synonyms denote equivalent objects, predicates or types.
Examples: Einstein and A. –Einstein almost certainly refer to same object.
Also, invented and has-invented refer to same predicate.
Inclusion Dependencies describes subset relationship between two predicates. Examples: invented (?x, ?y ) discovered (?x, ?y). Functional Dependencies are useful to answer query with negation or why an object is not an answer.For example, a probabilistic FD indicating a person can only be born in one Country: born-in(?x, <country> ?y): ?x -> ?y p=0.95 “All Scientists born in Germany that taught at Princeton”. If after receivingthe answers, they ask again to the system “Why Einstein is not an answer?”. Using the above FD, the system will answer: “As born-in (Einstein, Switzerland)” and FD tells a person can only born in oneCountry, therefore probability of born-in (Einstein, Germany) is very low.Slide9
Information Extraction
Example
Description
IE technique
invented(Edison, phonograph)
Arity-2 fact
TextRunner
<scientist> Einstein
Type (
hypernymy
)
KnowItAll
has-invented = invented
Synonymy
DIRT
invented
discovered
ID (
troponymy
)
?
FD:
has-
capital(x
,
y
)
has-capital(y)FD (rule)?Slide10
ExDB Work Flow
…no one could surprising. In1877, Edisoninvented thephonograph.
Although he…
…didnt
surprising. In1877, Edisoninvented thephonograph.Although he……was surprising. In1877, Edisoninvented thephonograph.Although he…Obj1PredObj2
prob
Edison
invented
phonograph
0.97
Morgan
born-in
1837
0.85
Type
Instance
prob
scientist
Einstein
0.99
city
Seattle
0.92
Pred1
Pred2
prob
invented
did-invent
0.85
invented
created
0.72FactsTypesSynonymsRDBMSQuerymiddlewareinvented(Edison ?e, ?i)
1. Run extractors2. Populate data model3. Query Processing & Applications
WebSlide11
Populate Data Model
Obj1
Pred
Obj2
prob
Edison
invented
phonograph
0.97
Morgan
born-in
1837
0.85
Type
Instance
prob
scientist
Einstein
0.99
city
Boston
0.92
Pred1
Pred2
prob
invented
did-invent
0.85
invented
created
0.72
Inclusion
Includer
prob
invented
discovered
0.81
Seattle
Washington
0.65
LHS
RHS
prob
capital(x, y)
capital(y)
0.77
born-
in(x
)
country(y
)
0.95
Facts
Types
Synonyms
IDs
FDs
It was big news when
Edison invented
the
phonograph
…
He visited
cities
such as
Boston
and
New York
.
We all know that
Edison
did-
invent
the
light bulb
.
…
In 1877
Edison
created
the
phonograph
.
Morgan was born-in 1837 into a prosperous mercantile-banking family
…
Einstein is
one of the best known scientists and intellectuals of all time.
For fact extraction
ExDB
uses unsupervised system called
TextRunner
.
TextRunner
generates a large set of extraction while running on entire corpus of text.
Unlike other IE systems, it does not require a set of target predicates specified beforehand, rather it starts by using a heavy weight linguistic parser to generate high quality extraction triples.
Later they use these high quality triples as the training set to generate a light weight extraction classifier that can run on entire web-scale corpus
TextRunner
For type extraction
ExDB
uses the
KnowItAll
system.
KnowItALL
searches the entire corpus to extract
hypernym or “is-a” relationships. For example: it extracts city (Boston) from “cities such as Seattle and Boston”.Assign each extraction a probability based on its frequency (or search engine hit count).
knowItAll
ExDB uses DIRT algorithm to extract predicate synonyms.DIRT computes the degree to which the argument pairs of two predicates coincide. For example, invented and has-invented will overlap many argument pairs like Edison/Light-bulb or Einstein/theory-of-relativity.
DIRTSlide12
ExDB Work Flow
…no one could surprising. In1877, Edisoninvented thephonograph.
Although he…
…didnt
surprising. In1877, Edisoninvented thephonograph.Although he……was surprising. In1877, Edisoninvented thephonograph.Although he…Obj1Pred
Obj2
prob
Edison
invented
phonograph
0.97
Morgan
born-in
1837
0.85
Type
Instance
prob
scientist
Einstein
0.99
city
Seattle
0.92
Pred1
Pred2
prob
invented
did-invent
0.85
invented
created
0.72
Facts
Types
Synonyms
RDBMS
Query
middleware
invented(Edison
?
e
, ?
i
)
1. Run extractors
2. Populate data model
3.
Query
Processing &
Applications
WebSlide13
ExDB Queries
ExDB proposes the users to query over the web Data model using Datalog-like notation.
Example:
q(?i
) :- invented(Edison, ?i) returns all inventions by Edison.Example constranits: q(?x, ?y) :- died-in(<Scientist> ?x, 1955?y)Example query for locally available inexpensive electronics: q(?x, ?y, ?z) :- for-sale-in(<electronics> ?x, Seattle ?y), costs (?x, ?z), (?z < 25)Another example can be: q(?x, ?y, ?z) :- invented(<scientists> ?x, ?y), died-in (?x, <year> ?z), (?z < 1900)
Example of projection queries: q(?s) :- invented(<scientist> ?s, ?i)Slide14
Query Processing
Non-projecting queriesInvolves a series of join against tables in the Web Data ModelProbability of a joined tuple is the product of the individual tuple’s probabilitiesSelect top-
k
queries ranked by their probability as results.
ObjectClasseinsteinscientistbostoncitybohrscientistfrancecountrycuriescientistBugs bunnyscientistObject1PredicateObject2einsteininventedrelativity
1848Was-year-ofrevolutionedisoninventedphonographdukakisvisited
bostoneinsteindied-in
1955
humans
have
Cold-fusion
prob
0.99
0.98
0.95
0.92
0.91
prob
0.99
0.97
0.96
0.93
0.92
0.01
0.01
…
…
Types
Facts
Example:
q
(?x, ?y, ?z) :- invented (<scientist> ?x, ?y), died-in (?x, <year> ?z).ScientistInventedDied-inprobeinsteinrelativity19550.90…Slide15
Projecting queries
q
(?
s
) :- invented (<scientist> ?s, ?i) rank scientists according to the probability of the scientist invented something without caring much about the actual invention. Need to compute a disjunction of m probabilistic events. A scientist Tesla appears in the output q, if the tuple invented (Tesla, I0) is in the database. There can be many inventions I1, …, Im for Tesla such as invented (Tesla, Ii). Any of these are sufficient to return Tesla as an answer for q. As m can be very large, a large number of very low probability extractions can unexpectedly
result in a quite large probability. Therefore, try to abstract panel of experts, where an expert is a tuple with a score such as Invented (tesla, Fluroescent-Lighting), 0.95, which determine the probability of its appearing in
q. Slide16
Result of Projecting Queries
q(?s) :- invented(<scientist> ?s, x)
Scientist inventedSlide17
ExDB Prototype
Web crawl:
90M
pages
Facts: 338M tuples, 102M objectsTypes:
6.6M instancesSynonyms:
17k
pairs
No IDs or
FDs
yetSlide18
Applications
ExDB’s extracted Data are not meant to be examined directly, rather they are used to build topic-specific tables so that human user can appreciate.
A synthetic table about scientists, generated by merging answers from Died-in(<scientist> ?
x
, ?y), invented(<scientist> ?x, ?y), published(<scientist> ?x, ?y) and taught(<scientist> ?x, ?y).If it is possible to automatically generate an ExDB query from keywords, it is possible to build a very powerful query system.It is possible to build web Data cube over the large amount of read only structured Data of ExDB
.Slide19
Alternative Models
Schema Extraction Model, intends to find out single best schema for the entire set of extractions to transform the web Text into a traditional relational database
Three good criteria for schema extraction are:
Simplicity (few tables).
Completeness (All extractions appear in the output).Fullness ( output database has no NULLs).Slide20
Alternative Models
Text Query Model does not perform any information extraction at all, rather offers a descriptive query language to generates answers for users query very quickly.
Extract
city/date
tuples from band’s website.Indicate the city where she lives.Compute the dates when the band’s city and her own city are within 100 miles of each other.User’s QuerySlide21
Questions?
Thank You