WebSets: Extracting Sets of Entities from the Web
Author : briana-ranney | Published Date : 2025-05-12
Description: WebSets Extracting Sets of Entities from the Web Using Unsupervised Information Extraction Bhavana Dalvi William W Cohen and Jamie Callan Language Technologies Institute Carnegie Mellon University Motivation Experiments WebSets
Presentation Embed Code
Download Presentation
Download
Presentation The PPT/PDF document
"WebSets: Extracting Sets of Entities from the Web" is the property of its rightful owner.
Permission is granted to download and print the materials on this website for personal, non-commercial use only,
and to display it on your personal computer provided you do not modify the materials and that you retain all
copyright notices contained in the materials. By downloading content from our website, you accept the terms of
this agreement.
Transcript:WebSets: Extracting Sets of Entities from the Web:
WebSets: Extracting Sets of Entities from the Web Using Unsupervised Information Extraction Bhavana Dalvi , William W. Cohen and Jamie Callan Language Technologies Institute, Carnegie Mellon University Motivation Experiments WebSets Framework Application Acknowledgements This work is supported by Google and the Intelligence Advanced Research Projects Activity (IARPA) via Air Force Research Laboratory (AFRL) contract number FA8650-10-C-7058. Conclusions Many NLP tasks get benefit from concept-instance pairs Summarization, Co-reference resolution, Named entity extraction Existing knowledge bases (NELL, Freebase, …) are incomplete. Problem can be divided into : Detecting co-ordinate terms to find term clusters (i ~ j) Using hyponym patterns (“X such as Y”) to name the terms We worked on problem of automatically harvesting concept-instance pairs from a corpus of HTML tables. Hypothesis 1 : Entities appearing in a table column probably belong to the same concept. Hypothesis 2 : Frequent co- occurrence of a set of entities in multiple table columns and distinct web domains indicates that they represent some meaningful concept. We propose a unsupervised IE technique to extract concept-instance pairs from an HTML corpus. It is novel in that it relies solely on HTML tables to detect coordinate terms. Our triplet-based data representation helps in disambiguating multiple senses of the same noun-phrase. WebSets approach is corpus driven, efficient and scalable. We presented a method which takes O(N * logN) time to process the HTML tables of size O(N) and extract named entity sets from them. Labeled entity sets produced by WebSets can act as summary of a HTML corpus. Class-instance pairs thus produced are also being used to populate an existing Knowledge Base (NELL). Future research direction is to extend this method for doing Unsupervised Relation Extraction. TableId=21 , domain=“wikipedia.org” TableId=34 , domain=“aneki.com” Evaluation of quality of entity sets produced Hyponym Concept Dataset Corpus Summary : Hearst patterns e.g. “X such as Y” arg1 such as (w+ (and/or))? arg2 arg1 (w+ )? (and/or) other arg2 arg1 include (w+ (and/or))? arg2 arg1 including (w+ (and/or))? Arg2 ClueWeb09 dataset : 500M page sample of the Web Noun-pair context dataset e.g. “Obama is president of USA” (president of , Obama, USA) Bottom-Up Clustering Algorithm X, Y are hyponym, hypernym when context = Hearst pattern Record/cluster : Clusters = { } Go through each triplet record t so that |t.domains| > threshold For each existing cluster C check if t.entity overlaps with C.entity OR t.tableColumn overlaps with