Julian Kupiec Jan Pedersen amp Francine Chen ACM SIGIR 95 Presented by Mat Kelly CS895 Webbased Information Retrieval Old Dominion University November 22 2011 The Automatic Creation ID: 433252
Download Presentation The PPT/PDF document "A Trainable Document Summarizer" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
A
Trainable Document Summarizer
Julian Kupiec, Jan Pedersen & Francine ChenACM SIGIR ‘95
Presented by Mat KellyCS895 – Web-based Information RetrievalOld Dominion UniversityNovember 22, 2011
The Automatic
Creation
of
Literature Abstracts
H. P.
Luhn
IBM Journal of R&D, 1958Slide2
Luhn’s ObjectivesExploration into automatic
methods of obtaining abstractsSelects sentences that are most representative of pertinent infoCitations of author’s own statements constitute “auto-abstract”Slide3
Which sentences are best?Establish a significance factorFreq. of word occurrence
word significanceRelative position of signif. word in sentence is a measure for determining signif. of sentenceWhy does this work?Writer repeats certain words as he elaboratesSlide4
Over-SimplificationMethod does not differentiate words with same stem
Letter-by-letter analysis to determine P() of same stemWhile authors will opt for synonymous word choice, s/he’ll eventually run out and resort to repetition.
polic = { policing policy police
}Slide5
PremiseNo consideration given to meaning of words.Instead, the closer certain words are associated, the more specifically an aspect of the subject is being treated
Where the greatest number of freq. occurring different words are found close to each other, the prob. is high that information is most representative of the article.Slide6
Criterion is relationship of signif words to each other rather than distrib. over whole sentence.
Consider only portions of sentences that are bracketed by signif word, disregard those beyond limit from consideration of current bracket.Useful limit found is 4-5 non-signif words between signif wordsSlide7
Computing Significance Factor
Determine extent of cluster by bracketingCount # signif words in cluster
Divide square of # by total # words in clusterTested on 50 articles of 300-4500 words each, compared against 100-person manual generationSignificant Words* * * *
1 2 3 4 5 6 7[
]
A portion of a sentence is bracketed
If signif. words are not more than
4 apart, whole sentence is citedSlide8
Resolving power depends on total # words in article and decreases as total # of words increasesOvercome by running on subdivisions of article, highest ranking sentences combined to form abstract
Divisions might already exist with paper’s organizationOtherwise, divided arbitrarily and overlappingSlide9
ProceduresAbstracts prepared by first punching on cards(!)Pronouns
& prepositions deleted from lookup routineRest of words sorted alphabeticallyWords with common beginnings consolidated (rudimentary form of stemming)Produced errors up to 5% but did not affect resultsWords with low frequency removed, remaining were marked as significant
Sentence signif then computed with prev formulaSlide10
Abstract Creation with ResultApply cutoff value of sentence significanceFixed number of sentences required irrespective of document length
Sentences could be weighted by assigning premium value to predetermined set of words if article is of special interestIf no sentences meet threshold, reject article as too general for purpose of auto-abstractingSlide11
ExampleTwo major recent developments have called the attention of chemists, physiologists, physicists and other scientists to mental diseases: It has been found that extremely minute quantities of chemicals can induce hallucinations and bizarre psychic disturbances in normal people, and mood-altering drugs (tranquilizers, for instance) have made long-institutionalized people amenable to therapy. (
4.0)This
poses new possibilities for studying brain chemistry changes in health and sickness and their alleviation, the California researchers emphasized. (5.4) The new studies of brain chemistry have provided practical therapeutic results and tremendous encouragement to those who must care for mental patients. (5.4)
Generated AbstractSlide12
ConclusionsMethod proved feasibleHighly reliable, consistent and stable unlike manual creation
Possibility that author’s style causes inferior sentences to be promotesMethod helps to realize savings in human effort
Significant WordsSignificant Sentences
Inclusion in AbstractSlide13
Kupiec’s ObjectiveMotive: provide intermediate point between document title and full text (i.e.
abstract)Documents as short as 20% of the original can be as informative as the full text*Extracts can be non-uniqueCombination by numerous methods (including Luhn’s) would have the best performance.
* A.H. Morris, G.M. Kasper, and D.A. Adams. The effects and limitations of automated text condensing on reading comprehension performance. Information Systems Research, pages 17-35, March 1992Slide14
A Statistical Classification ProblemHave training set of documents w/ manually extracted abstracts
Develop classification function that est. prob. That a given sentence is included in abstractFrom this, generate new abstracts by ranking sentences according to this prob and select user-specified # of top scoring sentences.
Contributes to S’s score
Given Sentences
Feature
1
Feature
2
…
Feature
n
Determine P() of abstract inclusion
Using Bayesian Classifier
Inclusion Threshold
SCORE
}Slide15
Evaluation criterion: classification success rate/precisionRequires corpus (expensive)Acquired from non-profit
Engineering Information Co. – used as basis for experimentsAll previous methods assume that documents exist in isolationSlide16
FeaturesExperimentally Obtained
Sentence Len. Cutoff – short sentences are not usually included in summaries – 5 wordsFixed-Phrase – list of words and those after “Summary", "Conclusions”, etc are likely to be in summariesParagraph – Consider first 10 ¶ and last 5 ¶
Thematic Word – score sentences respective to inclusion of words within themeUppercase Word – e.g. proper names, scored similarly to thematic words, sentences that start with score double than later occurrencesSlide17
ClassifierFor each sentence, determine prob that it will be included in summary
S given k features:Since all features are discrete, equation can be put in terms of probs rather than likelihoods.Results in simple Bayesian classification function that assigns s as score, used to select sentences for inclusion in summarySlide18
About the CorpusArticles w/o abstracts, created manuallyafter the fact
188 document/summary pairs from 21 publications in scientific/technical domainsSummary avg
length is 3 sentencesSlide19
Sentence MatchingUsing manually created abstracts, match to
sentences in orig. documentDirect match - Verbatim or w/o minor modificationsDirect join – 2 or more sentences used to make summary sentence
Unmatchable – suspected fabrication without using sentences in documentIncomplete – Some overlap exists but content is not preserved in summarySummary sentence includes content from original but contains other information that is not covered by a direct join
Abstract
Direct match
Direct joinSlide20
EvaluationInsufficient data for separate test corpus, used cross-validation strategy for evaluationDocuments from a journal were selected for testing one at a time, all other document summary pairs were used for training
Results were summed over journalsUnmatchable/incomplete sentences were excluded from training and testing = 498 unique sentencesSlide21
Evaluating PerformanceFraction of manual
summary sentences that were reproduced, limited by text excerpting: (451+19)/568
=83%
# SentencesFract of CorpusDirect Sentence Matches
451
79%
Direct Joins
19
3%
Unmatchable Sentences
50
9%
Incomplete Single
Sentences
21
4%
Incomplete Joins
21
4%
Total Manual
Summary Sentences
568
Sentence produced is correct
if:
Has direct sentence match & present in manual summary
–
or –
Is in manual summary as part of direct join and all other components of join have been produced
Distribution of Correspondence in Training CorpusSlide22
Results
Of 568 Sentences195 direct matches, 6 direct joins
201 correctly ident. summary sentences (35% replication) Manual
summary generation has only 25% overlap between people and 55% for the same person over time. 211/498 (42%) sentences correctly identified by the summarizerSlide23
ConclusionsFor summarizes 25% size of document84% sentences selected that were also selected by professionals
For smaller summaries, improvement of 74% observed vs. simply presenting beginning of document.Slide24
Contributes to S’s score
Comparing the Processes
Luhn
Significant WordsSignificant Sentences
Inclusion in Abstract
Given Sentences
|Sentence| < 5
↑ After fixed phrase
Prior. 1
st
& Last
¶s
↑ Thematic Words
↑ Capitalized, non-unit words
Determine P() of abstract inclusion
Using Bayesian Classifier
Inclusion Threshold
SCORE
}
KupiecSlide25
ReferencesH.P. Luhn. The automatic creation of literature abstracts.
IBM J. Res. Develop., 2:159-165, 1959Julian Kupiec, Jan Pedersen, and Francine Chen. A trainable document summarizer. In Proc. of the 18th Annual International ACM/SIGIR Conference, pages 68-73, Seattle, WA, 1995