John Cadigan David Ellison Ethan Roday System Overview Document summarization system is organized as a multistep pipeline System Overview Two major components for content selection Feature selection step to generate sentence vectors ID: 660672
Download Presentation The PPT/PDF document "Towards an Extractive Summarization Syst..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Towards an Extractive Summarization System Using Sentence Vectors and Clustering
John
Cadigan
, David Ellison, Ethan RodaySlide2
System Overview
Document summarization system is organized as a multi-step pipeline.Slide3
System Overview
Two major components for content selection:
Feature selection step to generate sentence vectors
Clustering or classification for sentence selectionSlide4
Overview: Preprocessing
Extract raw text
Split into sentences
Tokenize each sentenceSlide5
Overview: Feature Selection
GloVe
vector lookup
Dense word vector
Average all word vectors to compute sentence vectorSlide6
Overview: Intermediate Representation
Unique ID for each sentence
Sentence ID Encodes:
Which document set (e.g. D0901A)
Which document
Which sentence
Example:
D0901A.1.2
Also keep track of:
Sentence embedding (
GloVe
average)
Original text as appears in documentSlide7
Overview: Content Selection
Cluster sentences (vectors) with k-means
Idea: most “representative” sentence is closest to center of clusterSlide8
Information Ordering
For each sentence, retrieve and order by original document ordering
This information is maintained in each sentence IDSlide9
Content Realization
Present each centroid sentence “as-is”
We’ve kept the original representation from input documentsSlide10
Quantitative results
ROUGE Scores
Analysis
Precision consistently higher than recall
On par with the k-means implementation of Li et al (2011).
P
R
F
Li et al. 2011 (k-means
R)
ROUGE-1
.228
.184
.202
.219
ROUGE-2
.052
.042
.046
.037
ROUGE-3
.016
.013
.014
-
ROUGE-4
.005
.004
.004
-Slide11
Qualitative results
Decent example summary
Flowering arrow bamboo is threatening a colony of endangered giant pandas in China but experts have come up with a plan to save them , state media said Monday .
Giant Panda
By the end of 2004 , arrow bamboo , the favorite food of giants , had blossomed on 7,420 hectares at the
Baishuijiang
…
Bad example summary
“At Littleton , America got a glimpse of the last stop on that train to hell America boarded decades ago when we declared that God is dead and that each of us is his or her own god who can make up the rules as we go along .”
As the day progressed , the snow became thick mud and as many as 100 visitors at a time came to the top of the hill to look at the rough wood cross that had been staked into the ground there .
I think we have to learn from this and not let those kids die in vain .”
Analysis
Bias towards long sentences
Poor prioritization of sentences
Low variationSlide12
Discussion: Preprocessing
Lots of bad sentences in current summaries:
Quotes
Non-sentencesArticle metadata: New York --LRB—Stop words contribute significant noiseLack of term weighting also results in noisy vectors
tf-idf
and LLR are low-effort and high-returnSlide13
Discussion: Vector Representation
Vectors are an effective way to represent similarity:
Compare GLOVE vectors of verbs, subjects and objects
Integrate a LDA topic modelNER and co-referenceOther representations may be very effective components:
Probabilistic: KL divergence
Graph-based:
LexRankSlide14
Discussion: Selection Methods
Several drawbacks of k-means as selection algorithm:
What is k?
How to prioritize clusters?How to select from clusters? Could other representations overcome these flaws?Slide15
Conclusion
Three major directions for next steps:
Preprocessing and noise removal
Refining and augmenting vector representations
Experimenting with new selection algorithms