/
A  Trainable Document Summarizer A  Trainable Document Summarizer

A Trainable Document Summarizer - PowerPoint Presentation

liane-varnes
liane-varnes . @liane-varnes
Follow
389 views
Uploaded On 2016-08-04

A Trainable Document Summarizer - PPT Presentation

Julian Kupiec Jan Pedersen amp Francine Chen ACM SIGIR 95 Presented by Mat Kelly CS895 Webbased Information Retrieval Old Dominion University November 22 2011 The Automatic Creation ID: 433252

words sentences summary sentence sentences words sentence summary signif abstract inclusion document direct score manual word abstracts significant results

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "A Trainable Document Summarizer" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

A

Trainable Document Summarizer

Julian Kupiec, Jan Pedersen & Francine ChenACM SIGIR ‘95

Presented by Mat KellyCS895 – Web-based Information RetrievalOld Dominion UniversityNovember 22, 2011

The Automatic

Creation

of

Literature Abstracts

H. P.

Luhn

IBM Journal of R&D, 1958Slide2

Luhn’s ObjectivesExploration into automatic

methods of obtaining abstractsSelects sentences that are most representative of pertinent infoCitations of author’s own statements constitute “auto-abstract”Slide3

Which sentences are best?Establish a significance factorFreq. of word occurrence

 word significanceRelative position of signif. word in sentence is a measure for determining signif. of sentenceWhy does this work?Writer repeats certain words as he elaboratesSlide4

Over-SimplificationMethod does not differentiate words with same stem

Letter-by-letter analysis to determine P() of same stemWhile authors will opt for synonymous word choice, s/he’ll eventually run out and resort to repetition.

polic = { policing policy police

}Slide5

PremiseNo consideration given to meaning of words.Instead, the closer certain words are associated, the more specifically an aspect of the subject is being treated

Where the greatest number of freq. occurring different words are found close to each other, the prob. is high that information is most representative of the article.Slide6

Criterion is relationship of signif words to each other rather than distrib. over whole sentence.

Consider only portions of sentences that are bracketed by signif word, disregard those beyond limit from consideration of current bracket.Useful limit found is 4-5 non-signif words between signif wordsSlide7

Computing Significance Factor

Determine extent of cluster by bracketingCount # signif words in cluster

Divide square of # by total # words in clusterTested on 50 articles of 300-4500 words each, compared against 100-person manual generationSignificant Words* * * *

1 2 3 4 5 6 7[

]

A portion of a sentence is bracketed

If signif. words are not more than

4 apart, whole sentence is citedSlide8

Resolving power depends on total # words in article and decreases as total # of words increasesOvercome by running on subdivisions of article, highest ranking sentences combined to form abstract

Divisions might already exist with paper’s organizationOtherwise, divided arbitrarily and overlappingSlide9

ProceduresAbstracts prepared by first punching on cards(!)Pronouns

& prepositions deleted from lookup routineRest of words sorted alphabeticallyWords with common beginnings consolidated (rudimentary form of stemming)Produced errors up to 5% but did not affect resultsWords with low frequency removed, remaining were marked as significant

Sentence signif then computed with prev formulaSlide10

Abstract Creation with ResultApply cutoff value of sentence significanceFixed number of sentences required irrespective of document length

Sentences could be weighted by assigning premium value to predetermined set of words if article is of special interestIf no sentences meet threshold, reject article as too general for purpose of auto-abstractingSlide11

ExampleTwo major recent developments have called the attention of chemists, physiologists, physicists and other scientists to mental diseases: It has been found that extremely minute quantities of chemicals can induce hallucinations and bizarre psychic disturbances in normal people, and mood-altering drugs (tranquilizers, for instance) have made long-institutionalized people amenable to therapy. (

4.0)This

poses new possibilities for studying brain chemistry changes in health and sickness and their alleviation, the California researchers emphasized. (5.4) The new studies of brain chemistry have provided practical therapeutic results and tremendous encouragement to those who must care for mental patients. (5.4)

Generated AbstractSlide12

ConclusionsMethod proved feasibleHighly reliable, consistent and stable unlike manual creation

Possibility that author’s style causes inferior sentences to be promotesMethod helps to realize savings in human effort

Significant WordsSignificant Sentences

Inclusion in AbstractSlide13

Kupiec’s ObjectiveMotive: provide intermediate point between document title and full text (i.e.

abstract)Documents as short as 20% of the original can be as informative as the full text*Extracts can be non-uniqueCombination by numerous methods (including Luhn’s) would have the best performance.

* A.H. Morris, G.M. Kasper, and D.A. Adams. The effects and limitations of automated text condensing on reading comprehension performance. Information Systems Research, pages 17-35, March 1992Slide14

A Statistical Classification ProblemHave training set of documents w/ manually extracted abstracts

Develop classification function that est. prob. That a given sentence is included in abstractFrom this, generate new abstracts by ranking sentences according to this prob and select user-specified # of top scoring sentences.

Contributes to S’s score

Given Sentences

Feature

1

Feature

2

Feature

n

Determine P() of abstract inclusion

Using Bayesian Classifier

Inclusion Threshold

SCORE

}Slide15

Evaluation criterion: classification success rate/precisionRequires corpus (expensive)Acquired from non-profit

Engineering Information Co. – used as basis for experimentsAll previous methods assume that documents exist in isolationSlide16

FeaturesExperimentally Obtained

Sentence Len. Cutoff – short sentences are not usually included in summaries – 5 wordsFixed-Phrase – list of words and those after “Summary", "Conclusions”, etc are likely to be in summariesParagraph – Consider first 10 ¶ and last 5 ¶

Thematic Word – score sentences respective to inclusion of words within themeUppercase Word – e.g. proper names, scored similarly to thematic words, sentences that start with score double than later occurrencesSlide17

ClassifierFor each sentence, determine prob that it will be included in summary

S given k features:Since all features are discrete, equation can be put in terms of probs rather than likelihoods.Results in simple Bayesian classification function that assigns s as score, used to select sentences for inclusion in summarySlide18

About the CorpusArticles w/o abstracts, created manuallyafter the fact

188 document/summary pairs from 21 publications in scientific/technical domainsSummary avg

length is 3 sentencesSlide19

Sentence MatchingUsing manually created abstracts, match to

sentences in orig. documentDirect match - Verbatim or w/o minor modificationsDirect join – 2 or more sentences used to make summary sentence

Unmatchable – suspected fabrication without using sentences in documentIncomplete – Some overlap exists but content is not preserved in summarySummary sentence includes content from original but contains other information that is not covered by a direct join

Abstract

Direct match

Direct joinSlide20

EvaluationInsufficient data for separate test corpus, used cross-validation strategy for evaluationDocuments from a journal were selected for testing one at a time, all other document summary pairs were used for training

Results were summed over journalsUnmatchable/incomplete sentences were excluded from training and testing = 498 unique sentencesSlide21

Evaluating PerformanceFraction of manual

summary sentences that were reproduced, limited by text excerpting: (451+19)/568

=83%

# SentencesFract of CorpusDirect Sentence Matches

451

79%

Direct Joins

19

3%

Unmatchable Sentences

50

9%

Incomplete Single

Sentences

21

4%

Incomplete Joins

21

4%

Total Manual

Summary Sentences

568

Sentence produced is correct

if:

Has direct sentence match & present in manual summary

or –

Is in manual summary as part of direct join and all other components of join have been produced

Distribution of Correspondence in Training CorpusSlide22

Results

Of 568 Sentences195 direct matches, 6 direct joins

201 correctly ident. summary sentences (35% replication) Manual

summary generation has only 25% overlap between people and 55% for the same person over time. 211/498 (42%) sentences correctly identified by the summarizerSlide23

ConclusionsFor summarizes 25% size of document84% sentences selected that were also selected by professionals

For smaller summaries, improvement of 74% observed vs. simply presenting beginning of document.Slide24

Contributes to S’s score

Comparing the Processes

Luhn

Significant WordsSignificant Sentences

Inclusion in Abstract

Given Sentences

|Sentence| < 5

↑ After fixed phrase

Prior. 1

st

& Last

¶s

↑ Thematic Words

↑ Capitalized, non-unit words

Determine P() of abstract inclusion

Using Bayesian Classifier

Inclusion Threshold

SCORE

}

KupiecSlide25

ReferencesH.P. Luhn. The automatic creation of literature abstracts.

IBM J. Res. Develop., 2:159-165, 1959Julian Kupiec, Jan Pedersen, and Francine Chen. A trainable document summarizer. In Proc. of the 18th Annual International ACM/SIGIR Conference, pages 68-73, Seattle, WA, 1995