Extracting Sets of Entities from the Web Using Unsupervised Information Extraction Bhavana Dalvi William W Cohen and Jamie Callan Language Technologies Institute School of Computer Science ID: 538703
Download Presentation The PPT/PDF document "WebSets" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
WebSets: Extracting Sets of Entities from the Web Using Unsupervised Information Extraction
Bhavana Dalvi, William W. Cohen and Jamie CallanLanguage Technologies Institute,School of Computer ScienceCarnegie Mellon University
1Slide2
Outline
Motivation Previous Work
Our Approach
Evaluation
Applications Conclusion
2Slide3
Motivation : Acquiring concept-instance pairs
NLP Tasks SummarizationCo-reference resolution
Named entity extraction
Knowledge Bases
E.g. NELL, FreeBaseLimited coverage
Can we use relational data in tables on the web?
3Slide4
Previous Work
Co-ordinate terms same-conceptPittsburgh , New York USA , India
Hyponym patterns
X “such as” YCity such as
Pittsburgh
Van
Durme
and
Pasca
AAAI’08
Distributional
similarity
Candidate class-instance pairs
Hearst patterns
Shinzato
and
Torisawa
NAACL’04
Uses itemizations in webpages
Queries the Web to find candidate
hypernyms for entities
Table Columns
Hearst Patterns and Clueweb09 corpus
Open-domainUnsupervisedCorpus based
4Slide5
Our ApproachHypothesis – 1 Entities appearing in a table column probably belong
to the same conceptHypothesis – 2 If a set of entities co-occur frequently in multiple table- columns coming from multiple URL domains, then with high-probability they represent some meaningful concept.
Relational table identification
Entity clustering
5Slide6
Relational Table IdentificationHeuristic based
Table ClassifierCountryCapital City
India
Delhi
ChinaBeijingCanadaOttawaFranceParis
#
rows
#columns
#non-link columns
length(cells)
whether
the table is recursive
20% tables are relational
65% tables are relational
6Slide7
Data Representation Every table cell is a potential
entity Every table column is a potential entity set No labeled data - only co-occurrence information Clustering the table columns is important
Some columns are only relevant to the page they appear in
There are many overlapping columns7Slide8
Triplet Store Representation8
Country Capital CityIndia
Delhi
China
BeijingCanadaOttawaFrance
Paris
Country
Capital
City
Canada
Ottawa
China
Beijing
France
Paris
England
London
Entities
Table:Column
Domains
Canada, China, India
21:1dom1.comCanada, China, France
21:1, 34:1dom1.com, dom2.com
Beijing, Delhi, Ottawa21:2
dom1.comBeijing, Ottawa, Paris21:2, 34:2dom1.com, dom2.com
Canada, England, France
34:1
dom2.com
London, Ottawa, Paris
34:2
dom2.com
Entity-feature file :
Triplet Store
HTML tables
TableId=21 , domain=“dom1.com”
TableId=34 , domain=“dom2.com
”
Size(Table corpus) = O(N)
size(
Triplet store) = O(N)Slide9
Clustering Algorithm Number of clusters is unknown
- Parametric algorithms not useful Scalability is important for web scale corpus- Naïve single-link clustering algorithm creates whole similarity matrix :
Single pass algorithm
- Similar to Chinese Restaurant Process
Complexity :
9Slide10
Bottom-up Entity Clustering10
EntitiesTable: ColumnDomainsBeijing,
Ottawa, Paris
21:2 , 34:2
dom1.com, dom2.comChina, Canada, France21:1, 34:1dom1.com, dom2.com
Delhi, Beijing, Ottawa
21:2
dom1.com
Canada, England, France
34:1
dom2.com
London, Ottawa, Paris
34:2
dom2.com
India,
China, Canada
21:1
dom1.com
Triplets sorted by #domains
#clusters = 0Slide11
Bottom-up Entity Clustering11
EntitiesTable: ColumnDomainsBeijing,
Ottawa, Paris
21:2 , 34:2
dom1.com, dom2.comChina, Canada, France21:1, 34:1dom1.com, dom2.com
Delhi, Beijing, Ottawa
21:2
dom1.com
Canada, England, France
34:1
dom2.com
London, Ottawa, Paris
34:2
dom2.com
India,
China, Canada
21:1
dom1.com
Triplets sorted by #domains
#clusters = 0
Beijing,
Ottawa, Paris
21:2 , 34:2dom1.com, dom2.com
Cluster-1
#clusters = 1Slide12
Bottom-up Entity Clustering12
EntitiesTable: ColumnDomainsBeijing, Ottawa, Paris
21:2 , 34:2
dom1.com, dom2.com
China, Canada, France21:1, 34:1dom1.com, dom2.com
Delhi, Beijing, Ottawa
21:2
dom1.com
Canada, England, France
34:1
dom2.com
London, Ottawa, Paris
34:2
dom2.com
India,
China, Canada
21:1
dom1.com
Triplets sorted by #domains
#clusters =
1
Beijing, Ottawa, Paris
21:2 , 34:2dom1.com, dom2.com
Cluster-1China, Canada, France
21:1, 34:1
dom1.com, dom2.com
Cluster-2
#clusters = 2Slide13
Bottom-up Entity Clustering13
EntitiesTable: ColumnDomainsBeijing,
Ottawa, Paris
21:2 , 34:2
dom1.com, dom2.comChina, Canada, France21:1, 34:1dom1.com, dom2.com
Delhi, Beijing, Ottawa
21:2
dom1.com
Canada, England, France
34:1
dom2.com
London, Ottawa, Paris
34:2
dom2.com
India,
China, Canada
21:1
dom1.com
Triplets sorted by #domains
#clusters = 2
Beijing,
Ottawa, Paris21:2
, 34:2dom1.com, dom2.com
Cluster-1
China, Canada, France
21:1, 34:1
dom1.com, dom2.com
Cluster-2Slide14
Bottom-up Entity Clustering14
EntitiesTable: ColumnDomainsBeijing,
Ottawa, Paris
21:2 , 34:2
dom1.com, dom2.comChina, Canada, France21:1, 34:1dom1.com, dom2.com
Delhi, Beijing, Ottawa
21:2
dom1.com
Canada, England, France
34:1
dom2.com
London, Ottawa, Paris
34:2
dom2.com
India,
China, Canada
21:1
dom1.com
Triplets sorted by #domains
#clusters = 2
Beijing,
Delhi,
Ottawa, Paris21:2 , 34:2
dom1.com, dom2.comCluster-1
China, Canada, France
21:1, 34:1
dom1.com, dom2.com
Cluster-2Slide15
Bottom-up Entity Clustering15
EntitiesTable: ColumnDomainsBeijing,
Ottawa, Paris
21:2 , 34:2
dom1.com, dom2.comChina, Canada, France21:1, 34:1dom1.com, dom2.com
Delhi, Beijing, Ottawa
21:2
dom1.com
Canada, England, France
34:1
dom2.com
London, Ottawa, Paris
34:2
dom2.com
India,
China, Canada
21:1
dom1.com
Triplets sorted by #domains
#clusters = 2
Beijing,
Delhi, Ottawa, Paris
21:2 , 34:2
dom1.com, dom2.comCluster-1
China,
Canada, France
21:1,
34:1
dom1.com, dom2.com
Cluster-2Slide16
Bottom-up Entity Clustering16
EntitiesTable: ColumnDomainsBeijing, Ottawa, Paris
21:2 , 34:2
dom1.com, dom2.com
China, Canada, France21:1, 34:1dom1.com, dom2.com
Delhi, Beijing, Ottawa
21:2
dom1.com
Canada, England, France
34:1
dom2.com
London, Ottawa, Paris
34:2
dom2.com
India,
China, Canada
21:1
dom1.com
Triplets sorted by #domains
#clusters = 2
Beijing,
Delhi, Ottawa, Paris, London
21:2 , 34:2
dom1.com, dom2.comCluster-1
China, Canada, France, England, India
21:1, 34:1
dom1.com, dom2.com
Cluster-2
Algorithm is order dependent.
Finding optimal ordering is NP-hard.
Sub-optimal solution : order triplets in
descending order of #domains. Slide17
Hypernym Recommendation17
EntityHypernym : countChina
Country : 1000
Canada
Country : 300DelhiCity : 450, Destination : 10LondonCity : 500
Hypernym
Entities
Table:Column
Domains
Country : 2
India, China, Canada, France, England
21:1, 34:1
dom1.com, dom2.com
City : 2,
Destination : 1
Delhi, Beijing, Ottawa, London, Paris
21:2,
34:2
dom1.com, dom2.com
Entities
Table:
Column
DomainsIndia, China, Canada, France, England21:1, 34:1
dom1.com, dom2.com
Delhi, Beijing, Ottawa, London, Paris21:2, 34:2dom1.com, dom2.com
Score (
Hypernym
| cluster)
= f (co-occurrence counts of
hypernym
with entities in the cluster)Slide18
WebSets Framework
HTML documentsTriplet store
Entity Clusters
Hyponym Pattern Data
Relational Table Identification
Hypernym Recommendation
Bottom-up Clustering
Labeled entity sets
<Entities,
hypernym
>
18Slide19
Datasets19
DatasetDescription# HTML pages# Tables
Toy_Apple
Fruits
+ companies 574 2.6KDelicious_SportsLinks from Delicious w/ tag=sports 21K 146.3K
Delicious_Music
Links from Delicious
w/ tag=m
usic
183K 643.3K
CSEAL_Useful
Pages SEAL
found NELL entities on
30K
322.8K
ASIA_NELL
ASIA run on NELL categories
112K 676.9KASIA_INTASIA run on intelligence
domain 121K 621.3KClueweb_HPR
High pagerank sample of CLueweb 100K 586.9KSlide20
Evaluation : ClusteringTriplet record vs. entity record (Toy_Apple
) - triplets have entity disambiguation effectPrecision recall treadoff in hypernym recommendation WebSets vs. K-means
20
Dataset
MethodKPurity
NMI
RI
Toy_Apple
K-Means
40
0.96
0.71
0.98
WebSets
25
0.99
0.99
1.00
Delicious_SportsK-Means
500.720.680.98
WebSets320.830.641.00
Ideal # clusters Toy_Apple : 27Delicious_sports : 29 Slide21
Evaluation of Entity Sets 21
Dataset#Triplets#ClustersCSEAL_Useful
165.2K
1090
ASIA_NELL 11.4K448ASIA_INT 15.1K395
Clueweb_HPR
516.0
47Slide22
Evaluation of Entity Sets 22
Dataset#Triplets#Clusters# Clusters with hypernyms
CSEAL_Useful
165.2K
1090312ASIA_NELL 11.4K448266ASIA_INT
15.1K
395
218
Clueweb_HPR
516.0
47
34
Uniformly sample 100 clusters per dataset
Manual evaluation : cluster , ranked list of
hypernymsSlide23
Evaluation of Entity Sets 23
Dataset#Triplets#Clusters# Clusters with hypernyms
%Meaning-
ful
CSEAL_Useful165.2K109031269.0ASIA_NELL 11.4K448
266
73.0
ASIA_INT
15.1K
395
218
63.0
Clueweb_HPR
516.0
47
34
70.5
Decision : Whether a cluster is meaningful
Most specific hypernym from the list
Noisy clusters : e.g. navigational / menu itemsSlide24
Evaluation of Entity Sets 24
Dataset#Triplets#Clusters# Clusters with hypernyms
%Meaning-
ful
MRR of hypernymCSEAL_Useful165.2K109031269.00.56
ASIA_NELL
11.4K
448
266
73.0
0.59
ASIA_INT
15.1K
395
218
63.0
0.58
Clueweb_HPR
516.047
3470.50.56
Chose most specific label from the ranked list of hypernyms. MRR : computed against ranked list of hypernyms per clusterSlide25
Evaluation of Entity Sets 25
Dataset#Triplets#Clusters# Clusters with hypernyms
%Meaning-
ful
MRR of hypernymPrecisionCSEAL_Useful165.2K109031269.00.56
98.6%
ASIA_NELL
11.4K
448
266
73.0
0.59
98.5%
ASIA_INT
15.1K
395
218
63.0
0.5897.4%
Clueweb_HPR516.047
3470.50.5699.0%
#entities belonging to hypernym Precision(cluster, hypernym) = size(cluster) Amazon Mechanical Turk evaluation to compute precision
X : entity , Y : hypernym “Is X of type Y?” Slide26
26
Music DomainInstrumentsFlute, Tuba , String Orchestra, Chimes, Harmonium, Bassoon, Woodwinds, Glockenspiel, French horn, Timpani, PianoIntervalsWhole tone, Major sixth, Fifth, Perfect fifth, Seventh, Third, Diminished fifth, Whole step, Fourth, Minor seventh, Major third, Minor third
Genres
Smooth jazz, Gothic, Metal rock, Rock, Pop, Hip hop, Rock n roll, Country, Folk, Punk rock
Audio EquipmentsAudio editor , General midi synthesizer , Audio recorder , Multichannel digital audio workstation , Drum sequencer , Mixers , Music engraving system , Audio server , Mastering software , Soundfont sample player
Concepts very specific to domain.
Hypernyms are automatically generated.
Application : Corpus SummarySlide27
Conclusions & Future WorkWe proposed an unsupervised information extraction method to extract sets of entities from the HTML tables on the web.
Efficient clustering algorithm : O(N * log N) - high cluster purity (0.83-0.99) Unsupervised hypernym recommendation - intra-cluster precision (97-99%).
Future work : extend this technique for unsupervised extraction and naming of relations.
27Slide28
Thank You !!28Slide29
Evaluation: Hypernym Recommendation29
MethodKJ
%Accuracy
Yield
(#pairs produced)#Correct pairs (predicted)DPMInf0.034.6 88.6K
30.7K
5
0.2
50.0
0.8K
0.4K
DPMExt
Inf
0.0
21.9
100,828.0K
22,081.3K
5
0.2
44.0
2.8K 1.2KWS--67.7
73.7K 45.8KWSExt
--78.8
64.8K 51.1K
WebSets (WS) vs. Van Durme and Pasca method (DPM) AAAI’08DPM : Outputs a concept-instance pair if it has appeared in HCD & #clusters concept assigned to < K fraction of entities in cluster that match with concept > J
DPMExt : Outputs a concept-instance pair even if absent in HCD.
WS : Ranks the hypernyms by #entities it matches with
WSExt : Filters Hyponym Concept dataset by threshold on count Slide30
More stats about various datasets30
Dataset#Tables%Relational#Filteredtables
%Relational
filtered
#TripletsToy_Apple 2.6K50 7627515KDelicious_Sports
146.3K
15
57.0K
55
63K
Delicious_Music
643.3K
20
201.8K
75
93K
CSEAL_Useful 322.8K
30116.2K801148K
ASIA_NELL 676.9K20233.0K
55421KASIA_INT 621.3K15216.0K60
374KClueweb_HPR 586.9K10176.0K
3578KSlide31
Triplet record vs. entity record ( on Toy_Apple dataset )
MethodKFM w/ Entity recordsFM w/ Triplet records
WebSets
0.11 (K=25)
0.85 (K=34)K-Means300.090.3525
0.08
0.38
Triplet records can separate “Apple as fruit” and
“Apple as company”
31Slide32
Application : Contributions to existing NELL KB
32Shard#classes#pairs%acc
(1-20)
%acc
(1-100)%acc(1-500)%acc (1-1000)%acc(1-10000)100K13 944100.00
80.56
81.82
72.97
65.93
200K
3
542
100.00
86.49
77.78
75.00
76.40
300K24
2236100.0089.7479.31
72.7267.71400K44
445693.7580.6580.8574.6367.86500K
349194100.0083.7885.9677.03
73.40
Entity sets produced by WebSets are mapped to NELL categories entities in each cluster contexts in np-context data from Clueweb
context NELL categories (using useful patterns for each NELL category) score (NELL concept) = sum (TFIDF scores of contexts) TFIDF score (context) depends on #entities in the cluster it matches with (generality) and
#concepts it is associated with (specificity)