/
1 2 Can we automatically 1 2 Can we automatically

1 2 Can we automatically - PowerPoint Presentation

phoebe-click
phoebe-click . @phoebe-click
Follow
395 views
Uploaded On 2016-03-18

1 2 Can we automatically - PPT Presentation

Extract this information From the text instead of depending on creators To provide automated annotations Information Extraction What is Information Extraction Filling slots in a database from subsegments of text ID: 260852

amp learning information cohen learning amp cohen information extraction mccallum microsoft slides source software open web code concept bill

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "1 2 Can we automatically" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

1Slide2

2

Can we automatically

Extract this information

From the text (insteadof depending on creatorsTo provide automated annotations?)

Information

ExtractionSlide3

What is “Information Extraction”

Filling slots in a database from sub-segments of text.

As a task:

October 14, 2002, 4:00 a.m. PTFor years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation.Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers."We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“Richard Stallman, founder of the Free Software Foundation, countered saying…

NAME TITLE ORGANIZATION

Slides from Cohen & McCallumSlide4

What is “Information Extraction”

Filling slots in a database from sub-segments of text.

As a task:

October 14, 2002, 4:00 a.m. PTFor years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation.Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers.

"We can be open source. We love the concept of shared source," said

Bill Veghte

, a

Microsoft

VP

. "That's a super-important shift for us in terms of code access.“Richard Stallman, founder of the Free Software Foundation, countered saying…NAME TITLE ORGANIZATIONBill Gates CEO MicrosoftBill Veghte VP MicrosoftRichard Stallman founder Free Soft..IESlides from Cohen & McCallumSlide5

Tapping into the Collective Unconscious

Another thread of exciting research is driven by the realization that WEB is not random at all!

It is written by humans

…so analyzing its structure and content allows us to tap into the collective unconscious ..Meaning can emerge from syntactic notions such as “co-occurrences” and “connectedness”Examples:Analyzing term co-occurrences in the web-scale corpora to capture semantic information (today’s paper)Analyzing the link-structure of the web graph to discover communitiesDoD and NSA are very much into this as a way of breaking terrorist cellsAnalyzing the transaction patterns of customers (collaborative filtering)Big Idea 3How can we possibly do this without full NLP?

“(Un)wrapping the wrapped results..”Slide6

6

Fielded IE Systems: Citeseer, Google Scholar; Libra

How do they do it? Why do they fail?

Slide7

7

4/30Slide8

What is “Information Extraction”

Filling slots in a database from sub-segments of text.

As a task:

October 14, 2002, 4:00 a.m. PTFor years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation.Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers."We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“Richard Stallman, founder of the Free Software Foundation, countered saying…

NAME TITLE ORGANIZATION

Slides from Cohen & McCallumSlide9

What is “Information Extraction”

Filling slots in a database from sub-segments of text.

As a task:

October 14, 2002, 4:00 a.m. PTFor years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation.Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers.

"We can be open source. We love the concept of shared source," said

Bill Veghte

, a

Microsoft

VP

. "That's a super-important shift for us in terms of code access.“Richard Stallman, founder of the Free Software Foundation, countered saying…NAME TITLE ORGANIZATIONBill Gates CEO MicrosoftBill Veghte VP MicrosoftRichard Stallman founder Free Soft..IESlides from Cohen & McCallumSlide10

IE in Context

Create ontology

Segment

Classify

Associate

Cluster

Load DB

Spider

Query,

Search

Data mine

IEDocumentcollectionDatabaseFilter by relevanceLabel training dataTrain extraction modelsSlides from Cohen & McCallumSlide11

What is “Information Extraction”

Information Extraction =

segmentation

+ classification + clustering + associationAs a familyof techniques:October 14, 2002, 4:00 a.m. PTFor years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation.

Today,

Microsoft

claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers.

Gates

himself says

Microsoft

will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers."We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“Richard Stallman, founder of the Free Software Foundation, countered saying…Microsoft CorporationCEOBill GatesMicrosoftGatesMicrosoftBill VeghteMicrosoftVPRichard StallmanfounderFree Software FoundationSlides from Cohen & McCallumSlide12

What is “Information Extraction”

Information Extraction =

segmentation + classification

+ association + clusteringAs a familyof techniques:October 14, 2002, 4:00 a.m. PTFor years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation.

Today,

Microsoft

claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers.

Gates

himself says

Microsoft

will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers."We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“Richard Stallman, founder of the Free Software Foundation, countered saying…Microsoft CorporationCEOBill GatesMicrosoftGatesMicrosoftBill VeghteMicrosoftVPRichard StallmanfounderFree Software FoundationSlides from Cohen & McCallumSlide13

What is “Information Extraction”

Information Extraction =

segmentation + classification

+ association + clusteringAs a familyof techniques:October 14, 2002, 4:00 a.m. PTFor years, Microsoft Corporation CEO Bill Gates

railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation.

Today,

Microsoft

claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers.

Gates

himself says

Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers."We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“Richard Stallman, founder of the Free Software Foundation, countered saying…Microsoft CorporationCEOBill GatesMicrosoftGatesMicrosoftBill VeghteMicrosoftVPRichard StallmanfounderFree Software FoundationSlides from Cohen & McCallumSlide14

What is “Information Extraction”

Information Extraction =

segmentation + classification

+ association + clusteringAs a familyof techniques:October 14, 2002, 4:00 a.m. PTFor years, Microsoft Corporation CEO

Bill Gates

railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation.

Today,

Microsoft

claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers.

Gates

himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers."We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“Richard Stallman, founder of the Free Software Foundation, countered saying…Microsoft CorporationCEOBill GatesMicrosoftGatesMicrosoftBill VeghteMicrosoftVPRichard StallmanfounderFree Software Foundation

NAME

TITLE ORGANIZATION

Bill Gates

CEO

Microsoft

Bill

Veghte

VP

Microsoft

Richard

Stallman

founder

Free Soft..

*

*

*

*

Slides from Cohen & McCallumSlide15

IE History

Pre-Web

Mostly news articles

De Jong’s FRUMP [1982]Hand-built system to fill Schank-style “scripts” from news wireMessage Understanding Conference (MUC) DARPA [’87-’95], TIPSTER [’92-’96]Most early work dominated by hand-built modelsE.g. SRI’s FASTUS, hand-built FSMs.But by 1990’s, some machine learning: Lehnert, Cardie, Grishman and then HMMs: Elkan [Leek ’97], BBN [Bikel et al ’98]WebAAAI ’94 Spring Symposium on “Software Agents”Much discussion of ML applied to Web. Maes, Mitchell, Etzioni.Tom Mitchell’s WebKB, ‘96Build KB’s from the Web.Wrapper InductionFirst by hand, then ML: [Doorenbos ‘96], [Soderland ’96], [Kushmerick ’97],…

Slides from Cohen & McCallumSlide16

16

Information Extraction vs. NLP?

Information extraction is attempting to find

some of the structure and meaning in the hopefully template driven web pages.As IE becomes more ambitious and text becomes more free form, then ultimately we have IE becoming equal to NLP. Web does give one particular boost to NLPMassive corpora..Slide17

17

MUC

DARPA funded significant efforts in IE in the early to mid 1990’s.

Message Understanding Conference (MUC) was an annual event/competition where results were presented.Focused on extracting information from news articles:Terrorist eventsIndustrial joint venturesCompany management changesInformation extraction of particular interest to the intelligence community (CIA, NSA).Slide18

www.apple.com/retail

What makes IE from the Web Different?

Less grammar, but more formatting & linking

The directory structure, link structure, formatting & layout of the Web is its own new grammar.Apple to Open Its First Retail Storein New York CityMACWORLD EXPO, NEW YORK--July 17, 2002--Apple's first retail store in New York City will open in Manhattan's SoHo district on Thursday, July 18 at 8:00 a.m. EDT. The SoHo store will be Apple's largest retail store to date and is a stunning example of Apple's commitment to offering customers the world's best computer shopping experience."Fourteen months after opening our first retail store, our 31 stores are attracting over 100,000 visitors each week," said Steve Jobs, Apple's CEO. "We hope our SoHo store will surprise and delight both Mac and PC users who want to see everything the Mac can do to enhance their digital lifestyles."

www.apple.com/retail/soho

www.apple.com/retail/soho/theatre.html

Newswire

Web

Slides from Cohen & McCallumSlide19

Landscape of IE Tasks (1/4):

Pattern Feature Domain

Text paragraphs

without formattingGrammatical sentencesand some formatting & linksNon-grammatical snippets,rich formatting & linksTables

Astro Teller is the CEO and co-founder of BodyMedia. Astro holds a Ph.D. in Artificial Intelligence from Carnegie Mellon University, where he was inducted as a national Hertz fellow. His M.S. in symbolic and heuristic computation and B.S. in computer science are from Stanford University. His work in science, literature and business has appeared in international media from the New York Times to CNN to NPR.

Slides from Cohen & McCallumSlide20

Landscape of IE Tasks (2/4):

Pattern Scope

Web site specific

Genre specificWide, non-specific

Amazon.com Book Pages

Resumes

University Names

Formatting

Layout

Language

Slides from Cohen & McCallumSlide21

Landscape of IE Tasks (3/4):

Pattern Complexity

Closed set

He was born in Alabama…Regular setPhone: (413) 545-1323Complex pattern

University of Arkansas

P.O. Box 140

Hope, AR 71802

…was among the six houses sold by

Hope Feldman

that year.

Ambiguous patterns,needing context + manysources of evidenceThe CALD main office can be reached at 412-268-1299The big Wyoming sky…U.S. statesU.S. phone numbersU.S. postal addressesPerson namesHeadquarters:1128 Main Street, 4th FloorCincinnati, Ohio 45210Pawel Opalinski, SoftwareEngineer at WhizBang Labs.E.g. word patterns:Slides from Cohen & McCallumSlide22

Landscape of IE Tasks (4/4):

Pattern Combinations

Single entity

Person: Jack WelchBinary relationshipRelation: Person-TitlePerson: Jack WelchTitle: CEO

N-ary record

“Named entity” extraction

Jack Welch will retire as CEO of General Electric tomorrow. The top role

at the Connecticut company will be filled by Jeffrey Immelt.

Relation:

Company-Location

Company: General ElectricLocation: ConnecticutRelation: SuccessionCompany: General ElectricTitle: CEOOut: Jack WelshIn: Jeffrey ImmeltPerson: Jeffrey ImmeltLocation: ConnecticutSlides from Cohen & McCallumSlide23

Evaluation of Single Entity Extraction

Michael Kearns

and

Sebastian Seung will start Monday’s tutorial, followed by Richard M. Karpe and Martin Cooke.TRUTH:PRED:Precision = =

# correctly predicted segments 2

# predicted segments 6

Michael Kearns

and

Sebastian

Seung will start

Monday’s tutorial, followed by Richard M. Karpe and Martin Cooke.Recall = = # correctly predicted segments 2 # true segments 4 F1 = Harmonic mean of Precision & Recall = ((1/P) + (1/R)) / 21Slides from Cohen & McCallumSlide24

State of the Art Performance

Named entity recognition

Person, Location, Organization, …F1 in high 80’s or low- to mid-90’s

Binary relation extractionContained-in (Location1, Location2)Member-of (Person1, Organization1)F1 in 60’s or 70’s or 80’sWrapper inductionExtremely accurate performance obtainableHuman effort (~30min) required on each siteSlides from Cohen & McCallumSlide25

Landscape of IE Techniques (1/1):

Models

Any of these models can be used to capture words, formatting or both.

LexiconsAlabamaAlaska…WisconsinWyomingAbraham Lincoln was born in Kentucky.

member?

Classify Pre-segmented

Candidates

Abraham Lincoln

was born in

Kentucky

.Classifierwhich class?…and beyondSliding WindowAbraham Lincoln was born in Kentucky.Classifierwhich class?

Try alternatewindow sizes:Boundary ModelsAbraham Lincoln was born in Kentucky.Classifierwhich class?BEGINENDBEGINENDBEGIN

Context Free GrammarsAbraham Lincoln was born in Kentucky.NNPVPNPVNNPNPPPVPVPS

Most likely parse?Finite State MachinesAbraham Lincoln was born in Kentucky.Most likely state sequence?

Slides from Cohen & McCallumSlide26

Three Examples

(un)wrappers

That use path expressions on dom

treesPattern extractorsThat use path expressions on parse treesContext-based slot fillersThat annotate words into an ontology with the help of context surrounding them26Slide27

27

More Ambitious (Blue Sky) Approaches

Semantic web needs:

Tagged dataBackground knowledge(blue sky approaches to) automate bothKnowledge ExtractionExtract base level knowledge (“facts”) directly from the webAutomated taggingStart with a background ontology and tag other web pagesSemtag/SeekerThe information extraction tasks in fielded applications like Citeseer/Libra are narrowly focusedWe assume that we are learning specific relations (e.g. author/title etc)We assume that the extracted relations will be put in a database for db-style look-upLet’s look at state of the feasible art

before going to blue-sky..Slide28

28

Extraction from Templated Text

Many web pages are generated automatically from an underlying database.

Therefore, the HTML structure of pages is fairly specific and regular (semi-structured).However, output is intended for human consumption, not machine interpretation.An IE system for such generated pages allows the web site to be viewed as a structured database.An extractor for a semi-structured web site is sometimes referred to as a wrapper.Process of extracting from such pages is sometimes referred to as screen scraping.Slide29

29

Templated Extraction using DOM Trees

Web extraction may be aided by first parsing web pages into DOM trees.

Extraction patterns can then be specified as paths from the root of the DOM tree to the node containing the text to extract.May still need regex patterns to identify proper portion of the final CharacterData node.Slide30

30

Sample DOM Tree Extraction

HTML

BODY

FONT

B

Age of Spiritual

Machines

Ray

Kurzweil

ElementCharacter-DataHEADERbyATitle: HTMLBODYBCharacterDataAuthor: HTML BODYFONTA CharacterDataCan be “semi-automated” Users show examples and the program remembers the path expressionsWrapper maintenance? Cheap labor… Slide31

31Slide32

32Slide33

33

Basis for many startups

like

Junglee, Flipdog etcIf there is cooperation from the source, an API can be established removing the need for wrappersSlide34

Three Examples

(un)wrappers

That use path expressions on dom

treesPattern extractorsThat use path expressions on parse treesContext-based slot fillersThat annotate words into an ontology with the help of context surrounding them34Slide35

35

Extraction from Free Text involves

Natural Language Processing

If extracting from automatically generated web pages, simple regex patterns usually work.If extracting from more natural, unstructured, human-written text, some NLP may help.Part-of-speech (POS) taggingMark each word as a noun, verb, preposition, etc.Syntactic parsingIdentify phrases: NP, VP, PPSemantic word categories (e.g. from WordNet)KILL: kill, murder, assassinate, strangle, suffocateOff-the-shelf software available to do this!The “Brill” taggerExtraction patterns can use POS or phrase tags.Analogy to regex patterns on DOM trees for structured texSlide36

36

I.

Generate-and-Test Architecture

Generic extraction patterns (Hearst ’92):“…Cities such as Boston, Los Angeles, and Seattle…” (“C such as NP1

,

NP2

, and

NP3

”) =>

IS-A(each(head(

NP)), C), …Detailed information for several countries such as maps, …” ProperNoun(head(NP)) “I listen to pretty much all music but prefer country such as Garth Brooks”TemplateDrivenExtraction(where templateIn in terms of Syntax Tree)Slide37

37

Assessing the fact accuracy

Assess candidate extractions using Mutual Information (PMI-IR) (

Turney ’01).= 24.7M/107M ~23%Recall “water flows upwards”

PMI(

Seattle,Tomato

)=1.5M/107M

~1%

Seattle is 20times more likely to be

a city than a tomato! Slide38

38

..but many things indicate “city”ness

PMI = frequency of I & D co-occurrence

5-50 discriminators DiEach PMI for Di is a feature fiNaïve Bayes evidence combination:PMI is used for feature selection. NBC is used for learning. Hits used for assessing PMI as well as conditional probabilities

Discriminator phrases

f

i

:

“x is a city”

“x has a population of” “x is the capital of y” “x’s baseball team…”Keep the probablities with the extracted factsSlide39

39

Assessment In Action

I = “Yakima” (

1,340,000)D = <class name>I+D = “Yakima city” (2760)PMI = (2760 / 1.34M)= 0.02I = “Avocado” (1,000,000)I+D =“Avocado city” (10)PMI = 0.00001 << 0.02Slide40

40

Some Sources of ambiguity

Time:

“Clinton is the president” (in 1996).Context: “common misconceptions..”Opinion: Elvis…Multiple word senses: Amazon, Chicago, Chevy Chase, etc.Dominant senses can mask recessive ones!Approach: unmasking. ‘Chicago –City’Slide41

41

Chicago

City

MovieSlide42

42

Chicago Unmasked

City sense

Movie senseSlide43

43

Impact of Unmasking on PMI

Name Recessive Original Unmask Boost

Washington city 0.50 0.99 96% Casablanca city 0.41 0.93 127%Chevy Chase actor 0.09 0.58 512% Chicago movie 0.02 0.21 972%Slide44

44

CBioC: Collaborative Bio-Curation

Motivation

To help get information nuggets of articles and abstracts and store in a database.The challenge is that the number of articles are huge and they keep growing, and need to process natural language.The two existing approacheshuman curation and use of automatic information extraction systemsThey are not able to meet the challenge, as the first is expensive, while the second is error-prone. Slide45

45

CBioC (cont’d)

Approach

: We propose a solution that is inexpensive, and that scales up.Our approach takes advantage of automatic information extraction methods as a starting point, Based on the premise that if there are a lot of articles, then there must be a lot of readers and authors of these articles. We provide a mechanism by which the readers of the articles can participate and collaborate in the curation of information. We refer to our approach as “Collaborative Curation''. Slide46

46

Using the C-BioCurator System (cont’d)Slide47

What is the main difference between Knowitall and CBIOC?

Assessment– Knowitall does it by HITS. CBioC by votingSlide48

Three Examples

(un)wrappers

That use path expressions on dom

treesPattern extractorsThat use path expressions on parse treesContext-based slot fillersThat annotate words into an ontology with the help of context surrounding them48Slide49

49

Annotate base facts, given text and ontologySlide50

50

Annotation

“The Chicago Bulls announced yesterday that Michael Jordan will. . . ”

The <resource ref="http://tap.stanford.edu/BasketballTeam_Bulls">Chicago Bulls</resource>announced yesterday that <resource ref="http://tap.stanford.edu/AthleteJordan,_Michael">Michael Jordan</resource> will...’’Slide51

51

Semantic Annotation

Picture from http://lsdis.cs.uga.edu/courses/SemWebFall2005/courseMaterials/CSCI8350-Metadata.ppt

This simplest task of meta-data extraction on NL is to establish “type” relation between entities in the NL resources and concepts in ontologies.

Name Entity IdentificationSlide52

52

Semantics

Semantic Annotation

- The content of annotation consists of some rich semantic information - Targeted not only at human reader of resources but also software agents - formal : metadata following structural standards informal : personal notes written in the margin while reading an article - explicit : carry sufficient information for interpretation tacit : many personal annotations (telegraphic and incomplete)

http://www-scf.usc.edu/~csci586/slides/6Slide53

53

Uses of Annotation

http://www-scf.usc.edu/~csci586/slides/8Slide54

54

Objectives of Annotation

Generate Metadata for existing information

e.g., author-tag in HTMLRDF descriptions to HTMLContent description to Multimedia filesEmploy metadata forImproved searchNavigationPresentationSummarization of contentshttp://www.aifb.uni-karlsruhe.de/WBS/sst/Teaching/Intelligente%20System%20im%20WWW%20SS%202000/10-Annotation.pdfSlide55

55

Annotation

Current practice of annotation for knowledge identification and extraction

is time consumingneeds annotation by experts

is complex

Reduce burden of text annotation for Knowledge Management

www.racai.ro/EUROLAN-2003/html/presentations/SheffieldWilksBrewsterDingli/Eurolan2003AlexieiDingli.pptSlide56

SemTag & Seeker

WWW-03 Best Paper Prize

Seeded with TAP ontology (72k concepts)

And ~700 human judgments Crawled 264 million web pagesExtracted 434 million semantic tagsAutomatically disambiguatedSlide57

57

SemTag

Research project

IBMVery large scale – largest to date264 million web pagesGoal: to provide early set of widespread semantic tags through automated generationSlide58

58

SemTag

Uses broad, shallow knowledge base

TAP – lexical and taxonomic information about popular objectsMusicMoviesSportsEtc.Slide59

59

SemTag

Problem:

No write access to original document, so how do you annotate?Solution:Store annotations in a web-available databaseSlide60

60

SemTag

Semantic Label Bureau

Separate store of semantic annotation informationHTTP server that can be queried for annotation informationExampleFind all semantic tags for a given documentFind all semantic tags for a particular objectSlide61

61

SemTag

MethodologySlide62

62

SemTag

Three phases

Spotting Pass:Tokenize the documentAll instances plus 20 word window Learning Pass:Find corpus-wide distribution of terms at each internal node of taxonomyBased on a representative sample Tagging Pass:Scan windows to disambiguate each referenceFinally determined to be a TAP objectSlide63

63

SemTag

Another problem magnified by the scale:

Ambiguity ResolutionTwo fundamental categories of ambiguities:Some labels appear at multiple locationsSome entities have labels that occur in contexts that have no representative in the taxonomySlide64

64

SemTag

Solution:

Taxonomy Based Disambiguation (TBD)TBD expectation:Human tuned parameters used in small, critical sectionsAutomated approaches deal with bulk of informationSlide65

65

SemTag

TBD methodology:

Each node in the taxonomy is associated with a set of labelsCats, Football, Cars all contain “jaguar”Each label in the text is stored with a window of 20 words – the contextEach node has an associated similarity function mapping a context to a similarityHigher similarity  more likely to contain a referenceSlide66

66

SemTag

Similarity:

Built a 200,000 word lexicon (200,100 most common – 100 most common)200,000 dimensional vector spaceTraining: spots (label, context) and correct nodeEstimated the distribution of terms for nodes Standard cosine similarity TFIDF vectors (context vs. node)Slide67

67

SemTag

References inside the taxonomy vs. References outside the taxonomy

Multiple nodes: b = r  b != p(v)Is a context c appropriate for a node vSlide68

68

SemTag

Some internal nodes very popular:

Associate a measurement of how accurate Sim is likely to be at a nodeAlso, how ambiguous the node is overall (consistency of human judgment) TBD Algorithm: returns 1 or 0 to indicate whether a particular context c is on topic for a node v 82% accuracy on 434 million spotsSlide69

69

SemTagSlide70

70

Summary

Information extraction can be motivated either as explicating more structure from the data or as an automated way to Semantic Web

Extraction complexity depends on whether the text you have is “templated” or “free-form”Extraction from templated text can be done by regular expressionsExtraction from free form text requires NLPCan be done in terms of parts-of-speech-tagging“Annotation” involves connecting terms in a free form text to items in the background knowledgeIt too can be automatedSlide71

Sliding Windows

Slides from Cohen & McCallumSlide72

Landscape:

Focus of this Tutorial

Pattern complexity

Pattern feature domainPattern scopePattern combinationsModels

closed set

regular

complex

ambiguous

words

words + formatting

formattingsite-specificgenre-specificgeneralentitybinaryn-arylexiconregexwindowboundaryFSMCFGSlides from Cohen & McCallumSlide73

Extraction by Sliding Window

GRAND CHALLENGES FOR MACHINE LEARNING

Jaime Carbonell

School of Computer Science Carnegie Mellon University 3:30 pm 7500 Wean HallMachine learning has evolved from obscurity in the 1970s into a vibrant and popular discipline in artificial intelligence during the 1980s and 1990s. As a result of its success and growth, machine learning is evolving into a collection of related disciplines: inductive concept acquisition, analytic learning in problem solving (e.g. analogy, explanation-based learning), learning theory (e.g. PAC learning), genetic algorithms, connectionist learning, hybrid systems, and so on.CMU UseNet Seminar Announcement

Slides from Cohen & McCallumSlide74

Extraction by Sliding Window

GRAND CHALLENGES FOR MACHINE LEARNING

Jaime Carbonell

School of Computer Science Carnegie Mellon University 3:30 pm 7500 Wean HallMachine learning has evolved from obscurity in the 1970s into a vibrant and popular discipline in artificial intelligence during the 1980s and 1990s. As a result of its success and growth, machine learning is evolving into a collection of related disciplines: inductive concept acquisition, analytic learning in problem solving (e.g. analogy, explanation-based learning), learning theory (e.g. PAC learning), genetic algorithms, connectionist learning, hybrid systems, and so on.CMU UseNet Seminar Announcement

Slides from Cohen & McCallumSlide75

Extraction by Sliding Window

GRAND CHALLENGES FOR MACHINE LEARNING

Jaime Carbonell

School of Computer Science Carnegie Mellon University 3:30 pm 7500 Wean HallMachine learning has evolved from obscurity in the 1970s into a vibrant and popular discipline in artificial intelligence during the 1980s and 1990s. As a result of its success and growth, machine learning is evolving into a collection of related disciplines: inductive concept acquisition, analytic learning in problem solving (e.g. analogy, explanation-based learning), learning theory (e.g. PAC learning), genetic algorithms, connectionist learning, hybrid systems, and so on.CMU UseNet Seminar Announcement

Slides from Cohen & McCallumSlide76

Extraction by Sliding Window

GRAND CHALLENGES FOR MACHINE LEARNING

Jaime Carbonell

School of Computer Science Carnegie Mellon University 3:30 pm 7500 Wean HallMachine learning has evolved from obscurity in the 1970s into a vibrant and popular discipline in artificial intelligence during the 1980s and 1990s. As a result of its success and growth, machine learning is evolving into a collection of related disciplines: inductive concept acquisition, analytic learning in problem solving (e.g. analogy, explanation-based learning), learning theory (e.g. PAC learning), genetic algorithms, connectionist learning, hybrid systems, and so on.CMU UseNet Seminar Announcement

Slides from Cohen & McCallumSlide77

A “Naïve Bayes” Sliding Window Model

[Freitag 1997]

00 : pm Place : Wean Hall Rm 5409 Speaker : Sebastian Thrun

w

t-m

w

t-1

w

t

w

t+n

w t+n+1w t+n+mprefixcontentssuffixOther examples of sliding window: [Baluja et al 2000](decision tree over individual words & their context)If P(“Wean Hall Rm 5409” = LOCATION) is above some threshold, extract it. ……

Estimate Pr(LOCATION|window) using Bayes ruleTry all “reasonable” windows (vary length, position)Assume independence for length, prefix words, suffix words, content wordsEstimate from data quantities like: Pr(“Place” in prefix|LOCATION)Slides from Cohen & McCallumSlide78

“Naïve Bayes” Sliding Window Results

GRAND CHALLENGES FOR MACHINE LEARNING

Jaime Carbonell

School of Computer Science Carnegie Mellon University 3:30 pm 7500 Wean HallMachine learning has evolved from obscurity in the 1970s into a vibrant and popular discipline in artificial intelligence during the 1980s and 1990s. As a result of its success and growth, machine learning is evolving into a collection of related disciplines: inductive concept acquisition, analytic learning in problem solving (e.g. analogy, explanation-based learning), learning theory (e.g. PAC learning), genetic algorithms, connectionist learning, hybrid systems, and so on.Domain: CMU UseNet Seminar Announcements

Field F1

Person Name: 30%

Location: 61%

Start Time: 98%

Slides from Cohen & McCallumSlide79

Realistic sliding-window-classifier IE

What windows to consider?

all

windows containing as many tokens as the shortest example, but no more tokens than the longest exampleHow to represent a classifier? It might:Restrict the length of window;Restrict the vocabulary or formatting used before/after/inside window;Restrict the relative order of tokens, etc.Learning MethodSRV: Top-Down Rule Learning [Frietag AAAI ‘98]Rapier: Bottom-Up [Califf & Mooney, AAAI ‘99]Slides from Cohen & McCallumSlide80

Rapier: results – precision/recall

Slides from Cohen & McCallumSlide81

Rapier – results

vs. SRV

Slides from Cohen & McCallumSlide82

Rule-learning approaches to sliding-window classification: Summary

SRV, Rapier, and WHISK

[Soderland KDD ‘97]Representations for

classifiers allow restriction of the relationships between tokens, etcRepresentations are carefully chosen subsets of even more powerful representations based on logic programming (ILP and Prolog)Use of these “heavyweight” representations is complicated, but seems to pay off in resultsCan simpler representations for classifiers work?Slides from Cohen & McCallumSlide83

BWI: Learning to detect boundaries

Another formulation: learn

three

probabilistic classifiers:START(i) = Prob( position i starts a field)END(j) = Prob( position j ends a field)LEN(k) = Prob( an extracted field has length k)Then score a possible extraction (i,j) bySTART(i) * END(j) * LEN(j-i)LEN(k) is estimated from a histogram [Freitag & Kushmerick, AAAI 2000]Slides from Cohen & McCallumSlide84

BWI: Learning to detect boundaries

BWI uses

boosting

to find “detectors” for START and ENDEach weak detector has a BEFORE and AFTER pattern (on tokens before/after position i).Each “pattern” is a sequence of tokens and/or wildcards like: anyAlphabeticToken, anyNumber, …Weak learner for “patterns” uses greedy search (+ lookahead) to repeatedly extend a pair of empty BEFORE,AFTER patternsSlides from Cohen & McCallumSlide85

BWI: Learning to detect boundaries

Field F1

Person Name: 30%

Location: 61%Start Time: 98%Slides from Cohen & McCallumSlide86

Problems with Sliding Windows

and Boundary Finders

Decisions in neighboring parts of the input are made independently from each other.

Naïve Bayes Sliding Window may predict a “seminar end time” before the “seminar start time”.It is possible for two overlapping windows to both be above threshold.In a Boundary-Finding system, left boundaries are laid down independently from right boundaries, and their pairing happens as a separate step.Slides from Cohen & McCallumSolution? Joint inference…Slide87

Extraction:

Named Entity 

Binary Relations

How Extend a Sliding Window Approach?Slide88

SnowballSlide89

Pattern Representation

Brittle candidate generation?

Can’t extract if location mentioned before organization?

<Pat_left, Tag_1, Pat_mid, Tag_2, Pat_rt>Tag_ is a named entity tagPat_ is vector (in term space)Degree of MatchDependence on Alembic TaggerSlide90

Generating & Evaluating Patterns

Generation of Candidate Patterns

Evaluation of Candidate Patterns

Selectivity vs Coverage vs Confidence (Precision)Rilloff’s Conf * log |Postive|2/2 ~ 4/12Slide91

Evaluating Tuples

Conf(T) = 1 –

(1 – Conf(P_i) * Match(T, P_i)))

i=0|P|Conf(P) = Conf_n(P) * W + Conf_o(P) * (1-W)Comments?Simulated Annealing?

Discard poor

tuples

?

(

vs

not count as seeds)

Lower confidence of old tuples? Slide92

Overall Algorithm

Relation to EM?

Relation to KnowItAll

Will it work for the long tail?Tagging vs Full NLPSynonymsNegative ExamplesGeneral Relations vs Functions (Keys)Slide93

Evaluation

Effect of Seed Quality

Effect of Seed QuantityOther DomainsShouldn’t this expt be easy?

Ease of UseTraining Examples vs Parameter TweakingSlide94

Contributions

Techniques for Pattern Generation

Strategies for Evaluating Patterns & TuplesEvaluation Methodology & MetricsSlide95

References

[Bikel et al 1997] Bikel, D.; Miller, S.; Schwartz, R.; and Weischedel, R. Nymble: a high-performance learning name-finder. In

Proceedings of ANLP’97

, p194-201.[Califf & Mooney 1999], Califf, M.E.; Mooney, R.: Relational Learning of Pattern-Match Rules for Information Extraction, in Proceedings of the Sixteenth National Conference on Artificial Intelligence (AAAI-99).[Cohen, Hurst, Jensen, 2002] Cohen, W.; Hurst, M.; Jensen, L.: A flexible learning system for wrapping tables and lists in HTML documents. Proceedings of The Eleventh International World Wide Web Conference (WWW-2002)[Cohen, Kautz, McAllester 2000] Cohen, W; Kautz, H.; McAllester, D.: Hardening soft information sources. Proceedings of the Sixth International Conference on Knowledge Discovery and Data Mining (KDD-2000).[Cohen, 1998] Cohen, W.: Integration of Heterogeneous Databases Without Common Domains Using Queries Based on Textual Similarity, in Proceedings of ACM SIGMOD-98.[Cohen, 2000a] Cohen, W.: Data Integration using Similarity Joins and a Word-based Information Representation Language, ACM Transactions on Information Systems, 18(3).[Cohen, 2000b] Cohen, W. Automatically Extracting Features for Concept Learning from the Web, Machine Learning: Proceedings of the Seventeeth International Conference (ML-2000).[Collins & Singer 1999] Collins, M.; and Singer, Y. Unsupervised models for named entity classification. In Proceedings of the Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora, 1999.[De Jong 1982] De Jong, G. An Overview of the FRUMP System. In: Lehnert, W. & Ringle, M. H. (eds),

Strategies for Natural Language Processing

. Larence Erlbaum, 1982, 149-176.

[Freitag 98] Freitag, D: Information extraction from HTML: application of a general machine learning approach,

Proceedings of the Fifteenth National Conference on Artificial Intelligence

(AAAI-98).

[Freitag, 1999], Freitag, D.

Machine Learning for Information Extraction in Informal Domains. Ph.D. dissertation, Carnegie Mellon University.[Freitag 2000], Freitag, D: Machine Learning for Information Extraction in Informal Domains, Machine Learning 39(2/3): 99-101 (2000).Freitag & Kushmerick, 1999] Freitag, D; Kushmerick, D.: Boosted Wrapper Induction. Proceedings of the Sixteenth National Conference on Artificial Intelligence (AAAI-99)[Freitag & McCallum 1999] Freitag, D. and McCallum, A. Information extraction using HMMs and shrinakge. In Proceedings AAAI-99 Workshop on Machine Learning for Information Extraction. AAAI Technical Report WS-99-11.[Kushmerick, 2000] Kushmerick, N: Wrapper Induction: efficiency and expressiveness, Artificial Intelligence, 118(pp 15-68).[Lafferty, McCallum & Pereira 2001] Lafferty, J.; McCallum, A.; and Pereira, F., Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data, In Proceedings of ICML-2001.[Leek 1997] Leek, T. R. Information extraction using hidden Markov models. Master’s thesis. UC San Diego.[McCallum, Freitag & Pereira 2000] McCallum, A.; Freitag, D.; and Pereira. F., Maximum entropy Markov models for information extraction and segmentation, In Proceedings of ICML-2000[Miller et al 2000] Miller, S.; Fox, H.; Ramshaw, L.; Weischedel, R. A Novel Use of Statistical Parsing to Extract Information from Text. Proceedings of the 1st Annual Meeting of the North American Chapter of the ACL (NAACL), p. 226 - 233.Slides from Cohen & McCallum