/
WebSets WebSets

WebSets - PowerPoint Presentation

liane-varnes
liane-varnes . @liane-varnes
Follow
368 views
Uploaded On 2017-04-17

WebSets - PPT Presentation

Extracting Sets of Entities from the Web Using Unsupervised Information Extraction Bhavana Dalvi William W Cohen and Jamie Callan Language Technologies Institute School of Computer Science ID: 538703

entity dom2 clusters ottawa dom2 entity ottawa clusters 2dom1 cluster 1dom1 china nell triplets beijing paris21 canada clustering entities

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "WebSets" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

WebSets: Extracting Sets of Entities from the Web Using Unsupervised Information Extraction

Bhavana Dalvi, William W. Cohen and Jamie CallanLanguage Technologies Institute,School of Computer ScienceCarnegie Mellon University

1Slide2

Outline

Motivation Previous Work

Our Approach

Evaluation

Applications Conclusion

2Slide3

Motivation : Acquiring concept-instance pairs

NLP Tasks SummarizationCo-reference resolution

Named entity extraction

Knowledge Bases

E.g. NELL, FreeBaseLimited coverage

Can we use relational data in tables on the web?

3Slide4

Previous Work

Co-ordinate terms same-conceptPittsburgh , New York USA , India

 

Hyponym patterns

X “such as” YCity such as

Pittsburgh

Van

Durme

and

Pasca

AAAI’08

Distributional

similarity

Candidate class-instance pairs

Hearst patterns

Shinzato

and

Torisawa

NAACL’04

Uses itemizations in webpages

Queries the Web to find candidate

hypernyms for entities

Table Columns

Hearst Patterns and Clueweb09 corpus

Open-domainUnsupervisedCorpus based

4Slide5

Our ApproachHypothesis – 1 Entities appearing in a table column probably belong

to the same conceptHypothesis – 2 If a set of entities co-occur frequently in multiple table- columns coming from multiple URL domains, then with high-probability they represent some meaningful concept.

Relational table identification

Entity clustering

5Slide6

Relational Table IdentificationHeuristic based

Table ClassifierCountryCapital City

India

Delhi

ChinaBeijingCanadaOttawaFranceParis

#

rows

#columns

#non-link columns

length(cells)

whether

the table is recursive

20% tables are relational

65% tables are relational

6Slide7

Data Representation Every table cell is a potential

entity Every table column is a potential entity set No labeled data - only co-occurrence information Clustering the table columns is important

Some columns are only relevant to the page they appear in

There are many overlapping columns7Slide8

Triplet Store Representation8

Country Capital CityIndia

Delhi

China

BeijingCanadaOttawaFrance

Paris

Country

Capital

City

Canada

Ottawa

China

Beijing

France

Paris

England

London

Entities

Table:Column

Domains

Canada, China, India

21:1dom1.comCanada, China, France

21:1, 34:1dom1.com, dom2.com

Beijing, Delhi, Ottawa21:2

dom1.comBeijing, Ottawa, Paris21:2, 34:2dom1.com, dom2.com

Canada, England, France

34:1

dom2.com

London, Ottawa, Paris

34:2

dom2.com

Entity-feature file :

Triplet Store

HTML tables

TableId=21 , domain=“dom1.com”

TableId=34 , domain=“dom2.com

Size(Table corpus) = O(N)

 size(

Triplet store) = O(N)Slide9

Clustering Algorithm Number of clusters is unknown

- Parametric algorithms not useful Scalability is important for web scale corpus- Naïve single-link clustering algorithm creates whole similarity matrix :

Single pass algorithm

- Similar to Chinese Restaurant Process

Complexity :

 

9Slide10

Bottom-up Entity Clustering10

EntitiesTable: ColumnDomainsBeijing,

Ottawa, Paris

21:2 , 34:2

dom1.com, dom2.comChina, Canada, France21:1, 34:1dom1.com, dom2.com

Delhi, Beijing, Ottawa

21:2

dom1.com

Canada, England, France

34:1

dom2.com

London, Ottawa, Paris

34:2

dom2.com

India,

China, Canada

21:1

dom1.com

Triplets sorted by #domains

#clusters = 0Slide11

Bottom-up Entity Clustering11

EntitiesTable: ColumnDomainsBeijing,

Ottawa, Paris

21:2 , 34:2

dom1.com, dom2.comChina, Canada, France21:1, 34:1dom1.com, dom2.com

Delhi, Beijing, Ottawa

21:2

dom1.com

Canada, England, France

34:1

dom2.com

London, Ottawa, Paris

34:2

dom2.com

India,

China, Canada

21:1

dom1.com

Triplets sorted by #domains

#clusters = 0

Beijing,

Ottawa, Paris

21:2 , 34:2dom1.com, dom2.com

Cluster-1

#clusters = 1Slide12

Bottom-up Entity Clustering12

EntitiesTable: ColumnDomainsBeijing, Ottawa, Paris

21:2 , 34:2

dom1.com, dom2.com

China, Canada, France21:1, 34:1dom1.com, dom2.com

Delhi, Beijing, Ottawa

21:2

dom1.com

Canada, England, France

34:1

dom2.com

London, Ottawa, Paris

34:2

dom2.com

India,

China, Canada

21:1

dom1.com

Triplets sorted by #domains

#clusters =

1

Beijing, Ottawa, Paris

21:2 , 34:2dom1.com, dom2.com

Cluster-1China, Canada, France

21:1, 34:1

dom1.com, dom2.com

Cluster-2

#clusters = 2Slide13

Bottom-up Entity Clustering13

EntitiesTable: ColumnDomainsBeijing,

Ottawa, Paris

21:2 , 34:2

dom1.com, dom2.comChina, Canada, France21:1, 34:1dom1.com, dom2.com

Delhi, Beijing, Ottawa

21:2

dom1.com

Canada, England, France

34:1

dom2.com

London, Ottawa, Paris

34:2

dom2.com

India,

China, Canada

21:1

dom1.com

Triplets sorted by #domains

#clusters = 2

Beijing,

Ottawa, Paris21:2

, 34:2dom1.com, dom2.com

Cluster-1

China, Canada, France

21:1, 34:1

dom1.com, dom2.com

Cluster-2Slide14

Bottom-up Entity Clustering14

EntitiesTable: ColumnDomainsBeijing,

Ottawa, Paris

21:2 , 34:2

dom1.com, dom2.comChina, Canada, France21:1, 34:1dom1.com, dom2.com

Delhi, Beijing, Ottawa

21:2

dom1.com

Canada, England, France

34:1

dom2.com

London, Ottawa, Paris

34:2

dom2.com

India,

China, Canada

21:1

dom1.com

Triplets sorted by #domains

#clusters = 2

Beijing,

Delhi,

Ottawa, Paris21:2 , 34:2

dom1.com, dom2.comCluster-1

China, Canada, France

21:1, 34:1

dom1.com, dom2.com

Cluster-2Slide15

Bottom-up Entity Clustering15

EntitiesTable: ColumnDomainsBeijing,

Ottawa, Paris

21:2 , 34:2

dom1.com, dom2.comChina, Canada, France21:1, 34:1dom1.com, dom2.com

Delhi, Beijing, Ottawa

21:2

dom1.com

Canada, England, France

34:1

dom2.com

London, Ottawa, Paris

34:2

dom2.com

India,

China, Canada

21:1

dom1.com

Triplets sorted by #domains

#clusters = 2

Beijing,

Delhi, Ottawa, Paris

21:2 , 34:2

dom1.com, dom2.comCluster-1

China,

Canada, France

21:1,

34:1

dom1.com, dom2.com

Cluster-2Slide16

Bottom-up Entity Clustering16

EntitiesTable: ColumnDomainsBeijing, Ottawa, Paris

21:2 , 34:2

dom1.com, dom2.com

China, Canada, France21:1, 34:1dom1.com, dom2.com

Delhi, Beijing, Ottawa

21:2

dom1.com

Canada, England, France

34:1

dom2.com

London, Ottawa, Paris

34:2

dom2.com

India,

China, Canada

21:1

dom1.com

Triplets sorted by #domains

#clusters = 2

Beijing,

Delhi, Ottawa, Paris, London

21:2 , 34:2

dom1.com, dom2.comCluster-1

China, Canada, France, England, India

21:1, 34:1

dom1.com, dom2.com

Cluster-2

Algorithm is order dependent.

Finding optimal ordering is NP-hard.

Sub-optimal solution : order triplets in

descending order of #domains. Slide17

Hypernym Recommendation17

EntityHypernym : countChina

Country : 1000

Canada

Country : 300DelhiCity : 450, Destination : 10LondonCity : 500

Hypernym

Entities

Table:Column

Domains

Country : 2

India, China, Canada, France, England

21:1, 34:1

dom1.com, dom2.com

City : 2,

Destination : 1

Delhi, Beijing, Ottawa, London, Paris

21:2,

34:2

dom1.com, dom2.com

Entities

Table:

Column

DomainsIndia, China, Canada, France, England21:1, 34:1

dom1.com, dom2.com

Delhi, Beijing, Ottawa, London, Paris21:2, 34:2dom1.com, dom2.com

Score (

Hypernym

| cluster)

= f (co-occurrence counts of

hypernym

with entities in the cluster)Slide18

WebSets Framework

HTML documentsTriplet store

Entity Clusters

Hyponym Pattern Data

Relational Table Identification

Hypernym Recommendation

Bottom-up Clustering

Labeled entity sets

<Entities,

hypernym

>

18Slide19

Datasets19

DatasetDescription# HTML pages# Tables

Toy_Apple

Fruits

+ companies 574 2.6KDelicious_SportsLinks from Delicious w/ tag=sports 21K 146.3K

Delicious_Music

Links from Delicious

w/ tag=m

usic

183K 643.3K

CSEAL_Useful

Pages SEAL

found NELL entities on

30K

322.8K

ASIA_NELL

ASIA run on NELL categories

112K 676.9KASIA_INTASIA run on intelligence

domain 121K 621.3KClueweb_HPR

High pagerank sample of CLueweb 100K 586.9KSlide20

Evaluation : ClusteringTriplet record vs. entity record (Toy_Apple

) - triplets have entity disambiguation effectPrecision recall treadoff in hypernym recommendation WebSets vs. K-means

20

Dataset

MethodKPurity

NMI

RI

Toy_Apple

K-Means

40

0.96

0.71

0.98

WebSets

25

0.99

0.99

1.00

Delicious_SportsK-Means

500.720.680.98

WebSets320.830.641.00

Ideal # clusters Toy_Apple : 27Delicious_sports : 29 Slide21

Evaluation of Entity Sets 21

Dataset#Triplets#ClustersCSEAL_Useful

165.2K

1090

ASIA_NELL 11.4K448ASIA_INT 15.1K395

Clueweb_HPR

516.0

47Slide22

Evaluation of Entity Sets 22

Dataset#Triplets#Clusters# Clusters with hypernyms

CSEAL_Useful

165.2K

1090312ASIA_NELL 11.4K448266ASIA_INT

15.1K

395

218

Clueweb_HPR

516.0

47

34

Uniformly sample 100 clusters per dataset

Manual evaluation : cluster , ranked list of

hypernymsSlide23

Evaluation of Entity Sets 23

Dataset#Triplets#Clusters# Clusters with hypernyms

%Meaning-

ful

CSEAL_Useful165.2K109031269.0ASIA_NELL 11.4K448

266

73.0

ASIA_INT

15.1K

395

218

63.0

Clueweb_HPR

516.0

47

34

70.5

Decision : Whether a cluster is meaningful

Most specific hypernym from the list

Noisy clusters : e.g. navigational / menu itemsSlide24

Evaluation of Entity Sets 24

Dataset#Triplets#Clusters# Clusters with hypernyms

%Meaning-

ful

MRR of hypernymCSEAL_Useful165.2K109031269.00.56

ASIA_NELL

11.4K

448

266

73.0

0.59

ASIA_INT

15.1K

395

218

63.0

0.58

Clueweb_HPR

516.047

3470.50.56

Chose most specific label from the ranked list of hypernyms. MRR : computed against ranked list of hypernyms per clusterSlide25

Evaluation of Entity Sets 25

Dataset#Triplets#Clusters# Clusters with hypernyms

%Meaning-

ful

MRR of hypernymPrecisionCSEAL_Useful165.2K109031269.00.56

98.6%

ASIA_NELL

11.4K

448

266

73.0

0.59

98.5%

ASIA_INT

15.1K

395

218

63.0

0.5897.4%

Clueweb_HPR516.047

3470.50.5699.0%

#entities belonging to hypernym Precision(cluster, hypernym) = size(cluster) Amazon Mechanical Turk evaluation to compute precision

X : entity , Y : hypernym  “Is X of type Y?” Slide26

26

Music DomainInstrumentsFlute, Tuba , String Orchestra, Chimes, Harmonium, Bassoon, Woodwinds, Glockenspiel, French horn, Timpani, PianoIntervalsWhole tone, Major sixth, Fifth, Perfect fifth, Seventh, Third, Diminished fifth, Whole step, Fourth, Minor seventh, Major third, Minor third

Genres

Smooth jazz, Gothic, Metal rock, Rock, Pop, Hip hop, Rock n roll, Country, Folk, Punk rock

Audio EquipmentsAudio editor , General midi synthesizer , Audio recorder , Multichannel digital audio workstation , Drum sequencer , Mixers , Music engraving system , Audio server , Mastering software , Soundfont sample player

Concepts very specific to domain.

Hypernyms are automatically generated.

Application : Corpus SummarySlide27

Conclusions & Future WorkWe proposed an unsupervised information extraction method to extract sets of entities from the HTML tables on the web.

Efficient clustering algorithm : O(N * log N) - high cluster purity (0.83-0.99) Unsupervised hypernym recommendation - intra-cluster precision (97-99%).

Future work : extend this technique for unsupervised extraction and naming of relations.

27Slide28

Thank You !!28Slide29

Evaluation: Hypernym Recommendation29

MethodKJ

%Accuracy

Yield

(#pairs produced)#Correct pairs (predicted)DPMInf0.034.6 88.6K

30.7K

5

0.2

50.0

0.8K

0.4K

DPMExt

Inf

0.0

21.9

100,828.0K

22,081.3K

5

0.2

44.0

2.8K 1.2KWS--67.7

73.7K 45.8KWSExt

--78.8

64.8K 51.1K

WebSets (WS) vs. Van Durme and Pasca method (DPM) AAAI’08DPM : Outputs a concept-instance pair if it has appeared in HCD & #clusters concept assigned to < K fraction of entities in cluster that match with concept > J

DPMExt : Outputs a concept-instance pair even if absent in HCD.

WS : Ranks the hypernyms by #entities it matches with

WSExt : Filters Hyponym Concept dataset by threshold on count Slide30

More stats about various datasets30

Dataset#Tables%Relational#Filteredtables

%Relational

filtered

#TripletsToy_Apple 2.6K50 7627515KDelicious_Sports

146.3K

15

57.0K

55

63K

Delicious_Music

643.3K

20

201.8K

75

93K

CSEAL_Useful 322.8K

30116.2K801148K

ASIA_NELL 676.9K20233.0K

55421KASIA_INT 621.3K15216.0K60

374KClueweb_HPR 586.9K10176.0K

3578KSlide31

Triplet record vs. entity record ( on Toy_Apple dataset )

MethodKFM w/ Entity recordsFM w/ Triplet records

WebSets

0.11 (K=25)

0.85 (K=34)K-Means300.090.3525

0.08

0.38

Triplet records can separate “Apple as fruit” and

“Apple as company”

31Slide32

Application : Contributions to existing NELL KB

32Shard#classes#pairs%acc

(1-20)

%acc

(1-100)%acc(1-500)%acc (1-1000)%acc(1-10000)100K13 944100.00

80.56

81.82

72.97

65.93

200K

3

542

100.00

86.49

77.78

75.00

76.40

300K24

2236100.0089.7479.31

72.7267.71400K44

445693.7580.6580.8574.6367.86500K

349194100.0083.7885.9677.03

73.40

Entity sets produced by WebSets are mapped to NELL categories entities in each cluster  contexts in np-context data from Clueweb

context  NELL categories (using useful patterns for each NELL category) score (NELL concept) = sum (TFIDF scores of contexts) TFIDF score (context) depends on #entities in the cluster it matches with (generality) and

#concepts it is associated with (specificity)

Related Contents


Next Show more