Extract this information From the text instead of depending on creators To provide automated annotations Information Extraction What is Information Extraction Filling slots in a database from subsegments of text ID: 260852
Download Presentation The PPT/PDF document "1 2 Can we automatically" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
1Slide2
2
Can we automatically
Extract this information
From the text (insteadof depending on creatorsTo provide automated annotations?)
Information
ExtractionSlide3
What is “Information Extraction”
Filling slots in a database from sub-segments of text.
As a task:
October 14, 2002, 4:00 a.m. PTFor years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation.Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers."We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“Richard Stallman, founder of the Free Software Foundation, countered saying…
NAME TITLE ORGANIZATION
Slides from Cohen & McCallumSlide4
What is “Information Extraction”
Filling slots in a database from sub-segments of text.
As a task:
October 14, 2002, 4:00 a.m. PTFor years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation.Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers.
"We can be open source. We love the concept of shared source," said
Bill Veghte
, a
Microsoft
VP
. "That's a super-important shift for us in terms of code access.“Richard Stallman, founder of the Free Software Foundation, countered saying…NAME TITLE ORGANIZATIONBill Gates CEO MicrosoftBill Veghte VP MicrosoftRichard Stallman founder Free Soft..IESlides from Cohen & McCallumSlide5
Tapping into the Collective Unconscious
Another thread of exciting research is driven by the realization that WEB is not random at all!
It is written by humans
…so analyzing its structure and content allows us to tap into the collective unconscious ..Meaning can emerge from syntactic notions such as “co-occurrences” and “connectedness”Examples:Analyzing term co-occurrences in the web-scale corpora to capture semantic information (today’s paper)Analyzing the link-structure of the web graph to discover communitiesDoD and NSA are very much into this as a way of breaking terrorist cellsAnalyzing the transaction patterns of customers (collaborative filtering)Big Idea 3How can we possibly do this without full NLP?
“(Un)wrapping the wrapped results..”Slide6
6
Fielded IE Systems: Citeseer, Google Scholar; Libra
How do they do it? Why do they fail?
Slide7
7
4/30Slide8
What is “Information Extraction”
Filling slots in a database from sub-segments of text.
As a task:
October 14, 2002, 4:00 a.m. PTFor years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation.Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers."We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“Richard Stallman, founder of the Free Software Foundation, countered saying…
NAME TITLE ORGANIZATION
Slides from Cohen & McCallumSlide9
What is “Information Extraction”
Filling slots in a database from sub-segments of text.
As a task:
October 14, 2002, 4:00 a.m. PTFor years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation.Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers.
"We can be open source. We love the concept of shared source," said
Bill Veghte
, a
Microsoft
VP
. "That's a super-important shift for us in terms of code access.“Richard Stallman, founder of the Free Software Foundation, countered saying…NAME TITLE ORGANIZATIONBill Gates CEO MicrosoftBill Veghte VP MicrosoftRichard Stallman founder Free Soft..IESlides from Cohen & McCallumSlide10
IE in Context
Create ontology
Segment
Classify
Associate
Cluster
Load DB
Spider
Query,
Search
Data mine
IEDocumentcollectionDatabaseFilter by relevanceLabel training dataTrain extraction modelsSlides from Cohen & McCallumSlide11
What is “Information Extraction”
Information Extraction =
segmentation
+ classification + clustering + associationAs a familyof techniques:October 14, 2002, 4:00 a.m. PTFor years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation.
Today,
Microsoft
claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers.
Gates
himself says
Microsoft
will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers."We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“Richard Stallman, founder of the Free Software Foundation, countered saying…Microsoft CorporationCEOBill GatesMicrosoftGatesMicrosoftBill VeghteMicrosoftVPRichard StallmanfounderFree Software FoundationSlides from Cohen & McCallumSlide12
What is “Information Extraction”
Information Extraction =
segmentation + classification
+ association + clusteringAs a familyof techniques:October 14, 2002, 4:00 a.m. PTFor years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation.
Today,
Microsoft
claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers.
Gates
himself says
Microsoft
will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers."We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“Richard Stallman, founder of the Free Software Foundation, countered saying…Microsoft CorporationCEOBill GatesMicrosoftGatesMicrosoftBill VeghteMicrosoftVPRichard StallmanfounderFree Software FoundationSlides from Cohen & McCallumSlide13
What is “Information Extraction”
Information Extraction =
segmentation + classification
+ association + clusteringAs a familyof techniques:October 14, 2002, 4:00 a.m. PTFor years, Microsoft Corporation CEO Bill Gates
railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation.
Today,
Microsoft
claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers.
Gates
himself says
Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers."We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“Richard Stallman, founder of the Free Software Foundation, countered saying…Microsoft CorporationCEOBill GatesMicrosoftGatesMicrosoftBill VeghteMicrosoftVPRichard StallmanfounderFree Software FoundationSlides from Cohen & McCallumSlide14
What is “Information Extraction”
Information Extraction =
segmentation + classification
+ association + clusteringAs a familyof techniques:October 14, 2002, 4:00 a.m. PTFor years, Microsoft Corporation CEO
Bill Gates
railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation.
Today,
Microsoft
claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers.
Gates
himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers."We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“Richard Stallman, founder of the Free Software Foundation, countered saying…Microsoft CorporationCEOBill GatesMicrosoftGatesMicrosoftBill VeghteMicrosoftVPRichard StallmanfounderFree Software Foundation
NAME
TITLE ORGANIZATION
Bill Gates
CEO
Microsoft
Bill
Veghte
VP
Microsoft
Richard
Stallman
founder
Free Soft..
*
*
*
*
Slides from Cohen & McCallumSlide15
IE History
Pre-Web
Mostly news articles
De Jong’s FRUMP [1982]Hand-built system to fill Schank-style “scripts” from news wireMessage Understanding Conference (MUC) DARPA [’87-’95], TIPSTER [’92-’96]Most early work dominated by hand-built modelsE.g. SRI’s FASTUS, hand-built FSMs.But by 1990’s, some machine learning: Lehnert, Cardie, Grishman and then HMMs: Elkan [Leek ’97], BBN [Bikel et al ’98]WebAAAI ’94 Spring Symposium on “Software Agents”Much discussion of ML applied to Web. Maes, Mitchell, Etzioni.Tom Mitchell’s WebKB, ‘96Build KB’s from the Web.Wrapper InductionFirst by hand, then ML: [Doorenbos ‘96], [Soderland ’96], [Kushmerick ’97],…
Slides from Cohen & McCallumSlide16
16
Information Extraction vs. NLP?
Information extraction is attempting to find
some of the structure and meaning in the hopefully template driven web pages.As IE becomes more ambitious and text becomes more free form, then ultimately we have IE becoming equal to NLP. Web does give one particular boost to NLPMassive corpora..Slide17
17
MUC
DARPA funded significant efforts in IE in the early to mid 1990’s.
Message Understanding Conference (MUC) was an annual event/competition where results were presented.Focused on extracting information from news articles:Terrorist eventsIndustrial joint venturesCompany management changesInformation extraction of particular interest to the intelligence community (CIA, NSA).Slide18
www.apple.com/retail
What makes IE from the Web Different?
Less grammar, but more formatting & linking
The directory structure, link structure, formatting & layout of the Web is its own new grammar.Apple to Open Its First Retail Storein New York CityMACWORLD EXPO, NEW YORK--July 17, 2002--Apple's first retail store in New York City will open in Manhattan's SoHo district on Thursday, July 18 at 8:00 a.m. EDT. The SoHo store will be Apple's largest retail store to date and is a stunning example of Apple's commitment to offering customers the world's best computer shopping experience."Fourteen months after opening our first retail store, our 31 stores are attracting over 100,000 visitors each week," said Steve Jobs, Apple's CEO. "We hope our SoHo store will surprise and delight both Mac and PC users who want to see everything the Mac can do to enhance their digital lifestyles."
www.apple.com/retail/soho
www.apple.com/retail/soho/theatre.html
Newswire
Web
Slides from Cohen & McCallumSlide19
Landscape of IE Tasks (1/4):
Pattern Feature Domain
Text paragraphs
without formattingGrammatical sentencesand some formatting & linksNon-grammatical snippets,rich formatting & linksTables
Astro Teller is the CEO and co-founder of BodyMedia. Astro holds a Ph.D. in Artificial Intelligence from Carnegie Mellon University, where he was inducted as a national Hertz fellow. His M.S. in symbolic and heuristic computation and B.S. in computer science are from Stanford University. His work in science, literature and business has appeared in international media from the New York Times to CNN to NPR.
Slides from Cohen & McCallumSlide20
Landscape of IE Tasks (2/4):
Pattern Scope
Web site specific
Genre specificWide, non-specific
Amazon.com Book Pages
Resumes
University Names
Formatting
Layout
Language
Slides from Cohen & McCallumSlide21
Landscape of IE Tasks (3/4):
Pattern Complexity
Closed set
He was born in Alabama…Regular setPhone: (413) 545-1323Complex pattern
University of Arkansas
P.O. Box 140
Hope, AR 71802
…was among the six houses sold by
Hope Feldman
that year.
Ambiguous patterns,needing context + manysources of evidenceThe CALD main office can be reached at 412-268-1299The big Wyoming sky…U.S. statesU.S. phone numbersU.S. postal addressesPerson namesHeadquarters:1128 Main Street, 4th FloorCincinnati, Ohio 45210Pawel Opalinski, SoftwareEngineer at WhizBang Labs.E.g. word patterns:Slides from Cohen & McCallumSlide22
Landscape of IE Tasks (4/4):
Pattern Combinations
Single entity
Person: Jack WelchBinary relationshipRelation: Person-TitlePerson: Jack WelchTitle: CEO
N-ary record
“Named entity” extraction
Jack Welch will retire as CEO of General Electric tomorrow. The top role
at the Connecticut company will be filled by Jeffrey Immelt.
Relation:
Company-Location
Company: General ElectricLocation: ConnecticutRelation: SuccessionCompany: General ElectricTitle: CEOOut: Jack WelshIn: Jeffrey ImmeltPerson: Jeffrey ImmeltLocation: ConnecticutSlides from Cohen & McCallumSlide23
Evaluation of Single Entity Extraction
Michael Kearns
and
Sebastian Seung will start Monday’s tutorial, followed by Richard M. Karpe and Martin Cooke.TRUTH:PRED:Precision = =
# correctly predicted segments 2
# predicted segments 6
Michael Kearns
and
Sebastian
Seung will start
Monday’s tutorial, followed by Richard M. Karpe and Martin Cooke.Recall = = # correctly predicted segments 2 # true segments 4 F1 = Harmonic mean of Precision & Recall = ((1/P) + (1/R)) / 21Slides from Cohen & McCallumSlide24
State of the Art Performance
Named entity recognition
Person, Location, Organization, …F1 in high 80’s or low- to mid-90’s
Binary relation extractionContained-in (Location1, Location2)Member-of (Person1, Organization1)F1 in 60’s or 70’s or 80’sWrapper inductionExtremely accurate performance obtainableHuman effort (~30min) required on each siteSlides from Cohen & McCallumSlide25
Landscape of IE Techniques (1/1):
Models
Any of these models can be used to capture words, formatting or both.
LexiconsAlabamaAlaska…WisconsinWyomingAbraham Lincoln was born in Kentucky.
member?
Classify Pre-segmented
Candidates
Abraham Lincoln
was born in
Kentucky
.Classifierwhich class?…and beyondSliding WindowAbraham Lincoln was born in Kentucky.Classifierwhich class?
Try alternatewindow sizes:Boundary ModelsAbraham Lincoln was born in Kentucky.Classifierwhich class?BEGINENDBEGINENDBEGIN
Context Free GrammarsAbraham Lincoln was born in Kentucky.NNPVPNPVNNPNPPPVPVPS
Most likely parse?Finite State MachinesAbraham Lincoln was born in Kentucky.Most likely state sequence?
Slides from Cohen & McCallumSlide26
Three Examples
(un)wrappers
That use path expressions on dom
treesPattern extractorsThat use path expressions on parse treesContext-based slot fillersThat annotate words into an ontology with the help of context surrounding them26Slide27
27
More Ambitious (Blue Sky) Approaches
Semantic web needs:
Tagged dataBackground knowledge(blue sky approaches to) automate bothKnowledge ExtractionExtract base level knowledge (“facts”) directly from the webAutomated taggingStart with a background ontology and tag other web pagesSemtag/SeekerThe information extraction tasks in fielded applications like Citeseer/Libra are narrowly focusedWe assume that we are learning specific relations (e.g. author/title etc)We assume that the extracted relations will be put in a database for db-style look-upLet’s look at state of the feasible art
before going to blue-sky..Slide28
28
Extraction from Templated Text
Many web pages are generated automatically from an underlying database.
Therefore, the HTML structure of pages is fairly specific and regular (semi-structured).However, output is intended for human consumption, not machine interpretation.An IE system for such generated pages allows the web site to be viewed as a structured database.An extractor for a semi-structured web site is sometimes referred to as a wrapper.Process of extracting from such pages is sometimes referred to as screen scraping.Slide29
29
Templated Extraction using DOM Trees
Web extraction may be aided by first parsing web pages into DOM trees.
Extraction patterns can then be specified as paths from the root of the DOM tree to the node containing the text to extract.May still need regex patterns to identify proper portion of the final CharacterData node.Slide30
30
Sample DOM Tree Extraction
HTML
BODY
FONT
B
Age of Spiritual
Machines
Ray
Kurzweil
ElementCharacter-DataHEADERbyATitle: HTMLBODYBCharacterDataAuthor: HTML BODYFONTA CharacterDataCan be “semi-automated” Users show examples and the program remembers the path expressionsWrapper maintenance? Cheap labor… Slide31
31Slide32
32Slide33
33
Basis for many startups
like
Junglee, Flipdog etcIf there is cooperation from the source, an API can be established removing the need for wrappersSlide34
Three Examples
(un)wrappers
That use path expressions on dom
treesPattern extractorsThat use path expressions on parse treesContext-based slot fillersThat annotate words into an ontology with the help of context surrounding them34Slide35
35
Extraction from Free Text involves
Natural Language Processing
If extracting from automatically generated web pages, simple regex patterns usually work.If extracting from more natural, unstructured, human-written text, some NLP may help.Part-of-speech (POS) taggingMark each word as a noun, verb, preposition, etc.Syntactic parsingIdentify phrases: NP, VP, PPSemantic word categories (e.g. from WordNet)KILL: kill, murder, assassinate, strangle, suffocateOff-the-shelf software available to do this!The “Brill” taggerExtraction patterns can use POS or phrase tags.Analogy to regex patterns on DOM trees for structured texSlide36
36
I.
Generate-and-Test Architecture
Generic extraction patterns (Hearst ’92):“…Cities such as Boston, Los Angeles, and Seattle…” (“C such as NP1
,
NP2
, and
NP3
”) =>
IS-A(each(head(
NP)), C), …Detailed information for several countries such as maps, …” ProperNoun(head(NP)) “I listen to pretty much all music but prefer country such as Garth Brooks”TemplateDrivenExtraction(where templateIn in terms of Syntax Tree)Slide37
37
Assessing the fact accuracy
Assess candidate extractions using Mutual Information (PMI-IR) (
Turney ’01).= 24.7M/107M ~23%Recall “water flows upwards”
PMI(
Seattle,Tomato
)=1.5M/107M
~1%
Seattle is 20times more likely to be
a city than a tomato! Slide38
38
..but many things indicate “city”ness
PMI = frequency of I & D co-occurrence
5-50 discriminators DiEach PMI for Di is a feature fiNaïve Bayes evidence combination:PMI is used for feature selection. NBC is used for learning. Hits used for assessing PMI as well as conditional probabilities
Discriminator phrases
f
i
:
“x is a city”
“x has a population of” “x is the capital of y” “x’s baseball team…”Keep the probablities with the extracted factsSlide39
39
Assessment In Action
I = “Yakima” (
1,340,000)D = <class name>I+D = “Yakima city” (2760)PMI = (2760 / 1.34M)= 0.02I = “Avocado” (1,000,000)I+D =“Avocado city” (10)PMI = 0.00001 << 0.02Slide40
40
Some Sources of ambiguity
Time:
“Clinton is the president” (in 1996).Context: “common misconceptions..”Opinion: Elvis…Multiple word senses: Amazon, Chicago, Chevy Chase, etc.Dominant senses can mask recessive ones!Approach: unmasking. ‘Chicago –City’Slide41
41
Chicago
City
MovieSlide42
42
Chicago Unmasked
City sense
Movie senseSlide43
43
Impact of Unmasking on PMI
Name Recessive Original Unmask Boost
Washington city 0.50 0.99 96% Casablanca city 0.41 0.93 127%Chevy Chase actor 0.09 0.58 512% Chicago movie 0.02 0.21 972%Slide44
44
CBioC: Collaborative Bio-Curation
Motivation
To help get information nuggets of articles and abstracts and store in a database.The challenge is that the number of articles are huge and they keep growing, and need to process natural language.The two existing approacheshuman curation and use of automatic information extraction systemsThey are not able to meet the challenge, as the first is expensive, while the second is error-prone. Slide45
45
CBioC (cont’d)
Approach
: We propose a solution that is inexpensive, and that scales up.Our approach takes advantage of automatic information extraction methods as a starting point, Based on the premise that if there are a lot of articles, then there must be a lot of readers and authors of these articles. We provide a mechanism by which the readers of the articles can participate and collaborate in the curation of information. We refer to our approach as “Collaborative Curation''. Slide46
46
Using the C-BioCurator System (cont’d)Slide47
What is the main difference between Knowitall and CBIOC?
Assessment– Knowitall does it by HITS. CBioC by votingSlide48
Three Examples
(un)wrappers
That use path expressions on dom
treesPattern extractorsThat use path expressions on parse treesContext-based slot fillersThat annotate words into an ontology with the help of context surrounding them48Slide49
49
Annotate base facts, given text and ontologySlide50
50
Annotation
“The Chicago Bulls announced yesterday that Michael Jordan will. . . ”
The <resource ref="http://tap.stanford.edu/BasketballTeam_Bulls">Chicago Bulls</resource>announced yesterday that <resource ref="http://tap.stanford.edu/AthleteJordan,_Michael">Michael Jordan</resource> will...’’Slide51
51
Semantic Annotation
Picture from http://lsdis.cs.uga.edu/courses/SemWebFall2005/courseMaterials/CSCI8350-Metadata.ppt
This simplest task of meta-data extraction on NL is to establish “type” relation between entities in the NL resources and concepts in ontologies.
Name Entity IdentificationSlide52
52
Semantics
Semantic Annotation
- The content of annotation consists of some rich semantic information - Targeted not only at human reader of resources but also software agents - formal : metadata following structural standards informal : personal notes written in the margin while reading an article - explicit : carry sufficient information for interpretation tacit : many personal annotations (telegraphic and incomplete)
http://www-scf.usc.edu/~csci586/slides/6Slide53
53
Uses of Annotation
http://www-scf.usc.edu/~csci586/slides/8Slide54
54
Objectives of Annotation
Generate Metadata for existing information
e.g., author-tag in HTMLRDF descriptions to HTMLContent description to Multimedia filesEmploy metadata forImproved searchNavigationPresentationSummarization of contentshttp://www.aifb.uni-karlsruhe.de/WBS/sst/Teaching/Intelligente%20System%20im%20WWW%20SS%202000/10-Annotation.pdfSlide55
55
Annotation
Current practice of annotation for knowledge identification and extraction
is time consumingneeds annotation by experts
is complex
Reduce burden of text annotation for Knowledge Management
www.racai.ro/EUROLAN-2003/html/presentations/SheffieldWilksBrewsterDingli/Eurolan2003AlexieiDingli.pptSlide56
SemTag & Seeker
WWW-03 Best Paper Prize
Seeded with TAP ontology (72k concepts)
And ~700 human judgments Crawled 264 million web pagesExtracted 434 million semantic tagsAutomatically disambiguatedSlide57
57
SemTag
Research project
IBMVery large scale – largest to date264 million web pagesGoal: to provide early set of widespread semantic tags through automated generationSlide58
58
SemTag
Uses broad, shallow knowledge base
TAP – lexical and taxonomic information about popular objectsMusicMoviesSportsEtc.Slide59
59
SemTag
Problem:
No write access to original document, so how do you annotate?Solution:Store annotations in a web-available databaseSlide60
60
SemTag
Semantic Label Bureau
Separate store of semantic annotation informationHTTP server that can be queried for annotation informationExampleFind all semantic tags for a given documentFind all semantic tags for a particular objectSlide61
61
SemTag
MethodologySlide62
62
SemTag
Three phases
Spotting Pass:Tokenize the documentAll instances plus 20 word window Learning Pass:Find corpus-wide distribution of terms at each internal node of taxonomyBased on a representative sample Tagging Pass:Scan windows to disambiguate each referenceFinally determined to be a TAP objectSlide63
63
SemTag
Another problem magnified by the scale:
Ambiguity ResolutionTwo fundamental categories of ambiguities:Some labels appear at multiple locationsSome entities have labels that occur in contexts that have no representative in the taxonomySlide64
64
SemTag
Solution:
Taxonomy Based Disambiguation (TBD)TBD expectation:Human tuned parameters used in small, critical sectionsAutomated approaches deal with bulk of informationSlide65
65
SemTag
TBD methodology:
Each node in the taxonomy is associated with a set of labelsCats, Football, Cars all contain “jaguar”Each label in the text is stored with a window of 20 words – the contextEach node has an associated similarity function mapping a context to a similarityHigher similarity more likely to contain a referenceSlide66
66
SemTag
Similarity:
Built a 200,000 word lexicon (200,100 most common – 100 most common)200,000 dimensional vector spaceTraining: spots (label, context) and correct nodeEstimated the distribution of terms for nodes Standard cosine similarity TFIDF vectors (context vs. node)Slide67
67
SemTag
References inside the taxonomy vs. References outside the taxonomy
Multiple nodes: b = r b != p(v)Is a context c appropriate for a node vSlide68
68
SemTag
Some internal nodes very popular:
Associate a measurement of how accurate Sim is likely to be at a nodeAlso, how ambiguous the node is overall (consistency of human judgment) TBD Algorithm: returns 1 or 0 to indicate whether a particular context c is on topic for a node v 82% accuracy on 434 million spotsSlide69
69
SemTagSlide70
70
Summary
Information extraction can be motivated either as explicating more structure from the data or as an automated way to Semantic Web
Extraction complexity depends on whether the text you have is “templated” or “free-form”Extraction from templated text can be done by regular expressionsExtraction from free form text requires NLPCan be done in terms of parts-of-speech-tagging“Annotation” involves connecting terms in a free form text to items in the background knowledgeIt too can be automatedSlide71
Sliding Windows
Slides from Cohen & McCallumSlide72
Landscape:
Focus of this Tutorial
Pattern complexity
Pattern feature domainPattern scopePattern combinationsModels
closed set
regular
complex
ambiguous
words
words + formatting
formattingsite-specificgenre-specificgeneralentitybinaryn-arylexiconregexwindowboundaryFSMCFGSlides from Cohen & McCallumSlide73
Extraction by Sliding Window
GRAND CHALLENGES FOR MACHINE LEARNING
Jaime Carbonell
School of Computer Science Carnegie Mellon University 3:30 pm 7500 Wean HallMachine learning has evolved from obscurity in the 1970s into a vibrant and popular discipline in artificial intelligence during the 1980s and 1990s. As a result of its success and growth, machine learning is evolving into a collection of related disciplines: inductive concept acquisition, analytic learning in problem solving (e.g. analogy, explanation-based learning), learning theory (e.g. PAC learning), genetic algorithms, connectionist learning, hybrid systems, and so on.CMU UseNet Seminar Announcement
Slides from Cohen & McCallumSlide74
Extraction by Sliding Window
GRAND CHALLENGES FOR MACHINE LEARNING
Jaime Carbonell
School of Computer Science Carnegie Mellon University 3:30 pm 7500 Wean HallMachine learning has evolved from obscurity in the 1970s into a vibrant and popular discipline in artificial intelligence during the 1980s and 1990s. As a result of its success and growth, machine learning is evolving into a collection of related disciplines: inductive concept acquisition, analytic learning in problem solving (e.g. analogy, explanation-based learning), learning theory (e.g. PAC learning), genetic algorithms, connectionist learning, hybrid systems, and so on.CMU UseNet Seminar Announcement
Slides from Cohen & McCallumSlide75
Extraction by Sliding Window
GRAND CHALLENGES FOR MACHINE LEARNING
Jaime Carbonell
School of Computer Science Carnegie Mellon University 3:30 pm 7500 Wean HallMachine learning has evolved from obscurity in the 1970s into a vibrant and popular discipline in artificial intelligence during the 1980s and 1990s. As a result of its success and growth, machine learning is evolving into a collection of related disciplines: inductive concept acquisition, analytic learning in problem solving (e.g. analogy, explanation-based learning), learning theory (e.g. PAC learning), genetic algorithms, connectionist learning, hybrid systems, and so on.CMU UseNet Seminar Announcement
Slides from Cohen & McCallumSlide76
Extraction by Sliding Window
GRAND CHALLENGES FOR MACHINE LEARNING
Jaime Carbonell
School of Computer Science Carnegie Mellon University 3:30 pm 7500 Wean HallMachine learning has evolved from obscurity in the 1970s into a vibrant and popular discipline in artificial intelligence during the 1980s and 1990s. As a result of its success and growth, machine learning is evolving into a collection of related disciplines: inductive concept acquisition, analytic learning in problem solving (e.g. analogy, explanation-based learning), learning theory (e.g. PAC learning), genetic algorithms, connectionist learning, hybrid systems, and so on.CMU UseNet Seminar Announcement
Slides from Cohen & McCallumSlide77
A “Naïve Bayes” Sliding Window Model
[Freitag 1997]
00 : pm Place : Wean Hall Rm 5409 Speaker : Sebastian Thrun
w
t-m
w
t-1
w
t
w
t+n
w t+n+1w t+n+mprefixcontentssuffixOther examples of sliding window: [Baluja et al 2000](decision tree over individual words & their context)If P(“Wean Hall Rm 5409” = LOCATION) is above some threshold, extract it. ……
Estimate Pr(LOCATION|window) using Bayes ruleTry all “reasonable” windows (vary length, position)Assume independence for length, prefix words, suffix words, content wordsEstimate from data quantities like: Pr(“Place” in prefix|LOCATION)Slides from Cohen & McCallumSlide78
“Naïve Bayes” Sliding Window Results
GRAND CHALLENGES FOR MACHINE LEARNING
Jaime Carbonell
School of Computer Science Carnegie Mellon University 3:30 pm 7500 Wean HallMachine learning has evolved from obscurity in the 1970s into a vibrant and popular discipline in artificial intelligence during the 1980s and 1990s. As a result of its success and growth, machine learning is evolving into a collection of related disciplines: inductive concept acquisition, analytic learning in problem solving (e.g. analogy, explanation-based learning), learning theory (e.g. PAC learning), genetic algorithms, connectionist learning, hybrid systems, and so on.Domain: CMU UseNet Seminar Announcements
Field F1
Person Name: 30%
Location: 61%
Start Time: 98%
Slides from Cohen & McCallumSlide79
Realistic sliding-window-classifier IE
What windows to consider?
all
windows containing as many tokens as the shortest example, but no more tokens than the longest exampleHow to represent a classifier? It might:Restrict the length of window;Restrict the vocabulary or formatting used before/after/inside window;Restrict the relative order of tokens, etc.Learning MethodSRV: Top-Down Rule Learning [Frietag AAAI ‘98]Rapier: Bottom-Up [Califf & Mooney, AAAI ‘99]Slides from Cohen & McCallumSlide80
Rapier: results – precision/recall
Slides from Cohen & McCallumSlide81
Rapier – results
vs. SRV
Slides from Cohen & McCallumSlide82
Rule-learning approaches to sliding-window classification: Summary
SRV, Rapier, and WHISK
[Soderland KDD ‘97]Representations for
classifiers allow restriction of the relationships between tokens, etcRepresentations are carefully chosen subsets of even more powerful representations based on logic programming (ILP and Prolog)Use of these “heavyweight” representations is complicated, but seems to pay off in resultsCan simpler representations for classifiers work?Slides from Cohen & McCallumSlide83
BWI: Learning to detect boundaries
Another formulation: learn
three
probabilistic classifiers:START(i) = Prob( position i starts a field)END(j) = Prob( position j ends a field)LEN(k) = Prob( an extracted field has length k)Then score a possible extraction (i,j) bySTART(i) * END(j) * LEN(j-i)LEN(k) is estimated from a histogram [Freitag & Kushmerick, AAAI 2000]Slides from Cohen & McCallumSlide84
BWI: Learning to detect boundaries
BWI uses
boosting
to find “detectors” for START and ENDEach weak detector has a BEFORE and AFTER pattern (on tokens before/after position i).Each “pattern” is a sequence of tokens and/or wildcards like: anyAlphabeticToken, anyNumber, …Weak learner for “patterns” uses greedy search (+ lookahead) to repeatedly extend a pair of empty BEFORE,AFTER patternsSlides from Cohen & McCallumSlide85
BWI: Learning to detect boundaries
Field F1
Person Name: 30%
Location: 61%Start Time: 98%Slides from Cohen & McCallumSlide86
Problems with Sliding Windows
and Boundary Finders
Decisions in neighboring parts of the input are made independently from each other.
Naïve Bayes Sliding Window may predict a “seminar end time” before the “seminar start time”.It is possible for two overlapping windows to both be above threshold.In a Boundary-Finding system, left boundaries are laid down independently from right boundaries, and their pairing happens as a separate step.Slides from Cohen & McCallumSolution? Joint inference…Slide87
Extraction:
Named Entity
Binary Relations
How Extend a Sliding Window Approach?Slide88
SnowballSlide89
Pattern Representation
Brittle candidate generation?
Can’t extract if location mentioned before organization?
<Pat_left, Tag_1, Pat_mid, Tag_2, Pat_rt>Tag_ is a named entity tagPat_ is vector (in term space)Degree of MatchDependence on Alembic TaggerSlide90
Generating & Evaluating Patterns
Generation of Candidate Patterns
Evaluation of Candidate Patterns
Selectivity vs Coverage vs Confidence (Precision)Rilloff’s Conf * log |Postive|2/2 ~ 4/12Slide91
Evaluating Tuples
Conf(T) = 1 –
(1 – Conf(P_i) * Match(T, P_i)))
i=0|P|Conf(P) = Conf_n(P) * W + Conf_o(P) * (1-W)Comments?Simulated Annealing?
Discard poor
tuples
?
(
vs
not count as seeds)
Lower confidence of old tuples? Slide92
Overall Algorithm
Relation to EM?
Relation to KnowItAll
Will it work for the long tail?Tagging vs Full NLPSynonymsNegative ExamplesGeneral Relations vs Functions (Keys)Slide93
Evaluation
Effect of Seed Quality
Effect of Seed QuantityOther DomainsShouldn’t this expt be easy?
Ease of UseTraining Examples vs Parameter TweakingSlide94
Contributions
Techniques for Pattern Generation
Strategies for Evaluating Patterns & TuplesEvaluation Methodology & MetricsSlide95
References
[Bikel et al 1997] Bikel, D.; Miller, S.; Schwartz, R.; and Weischedel, R. Nymble: a high-performance learning name-finder. In
Proceedings of ANLP’97
, p194-201.[Califf & Mooney 1999], Califf, M.E.; Mooney, R.: Relational Learning of Pattern-Match Rules for Information Extraction, in Proceedings of the Sixteenth National Conference on Artificial Intelligence (AAAI-99).[Cohen, Hurst, Jensen, 2002] Cohen, W.; Hurst, M.; Jensen, L.: A flexible learning system for wrapping tables and lists in HTML documents. Proceedings of The Eleventh International World Wide Web Conference (WWW-2002)[Cohen, Kautz, McAllester 2000] Cohen, W; Kautz, H.; McAllester, D.: Hardening soft information sources. Proceedings of the Sixth International Conference on Knowledge Discovery and Data Mining (KDD-2000).[Cohen, 1998] Cohen, W.: Integration of Heterogeneous Databases Without Common Domains Using Queries Based on Textual Similarity, in Proceedings of ACM SIGMOD-98.[Cohen, 2000a] Cohen, W.: Data Integration using Similarity Joins and a Word-based Information Representation Language, ACM Transactions on Information Systems, 18(3).[Cohen, 2000b] Cohen, W. Automatically Extracting Features for Concept Learning from the Web, Machine Learning: Proceedings of the Seventeeth International Conference (ML-2000).[Collins & Singer 1999] Collins, M.; and Singer, Y. Unsupervised models for named entity classification. In Proceedings of the Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora, 1999.[De Jong 1982] De Jong, G. An Overview of the FRUMP System. In: Lehnert, W. & Ringle, M. H. (eds),
Strategies for Natural Language Processing
. Larence Erlbaum, 1982, 149-176.
[Freitag 98] Freitag, D: Information extraction from HTML: application of a general machine learning approach,
Proceedings of the Fifteenth National Conference on Artificial Intelligence
(AAAI-98).
[Freitag, 1999], Freitag, D.
Machine Learning for Information Extraction in Informal Domains. Ph.D. dissertation, Carnegie Mellon University.[Freitag 2000], Freitag, D: Machine Learning for Information Extraction in Informal Domains, Machine Learning 39(2/3): 99-101 (2000).Freitag & Kushmerick, 1999] Freitag, D; Kushmerick, D.: Boosted Wrapper Induction. Proceedings of the Sixteenth National Conference on Artificial Intelligence (AAAI-99)[Freitag & McCallum 1999] Freitag, D. and McCallum, A. Information extraction using HMMs and shrinakge. In Proceedings AAAI-99 Workshop on Machine Learning for Information Extraction. AAAI Technical Report WS-99-11.[Kushmerick, 2000] Kushmerick, N: Wrapper Induction: efficiency and expressiveness, Artificial Intelligence, 118(pp 15-68).[Lafferty, McCallum & Pereira 2001] Lafferty, J.; McCallum, A.; and Pereira, F., Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data, In Proceedings of ICML-2001.[Leek 1997] Leek, T. R. Information extraction using hidden Markov models. Master’s thesis. UC San Diego.[McCallum, Freitag & Pereira 2000] McCallum, A.; Freitag, D.; and Pereira. F., Maximum entropy Markov models for information extraction and segmentation, In Proceedings of ICML-2000[Miller et al 2000] Miller, S.; Fox, H.; Ramshaw, L.; Weischedel, R. A Novel Use of Statistical Parsing to Extract Information from Text. Proceedings of the 1st Annual Meeting of the North American Chapter of the ACL (NAACL), p. 226 - 233.Slides from Cohen & McCallum