/
Towards an Extractive Summarization System Using Sentence Vectors and Clustering Towards an Extractive Summarization System Using Sentence Vectors and Clustering

Towards an Extractive Summarization System Using Sentence Vectors and Clustering - PowerPoint Presentation

marina-yarberry
marina-yarberry . @marina-yarberry
Follow
370 views
Uploaded On 2018-03-22

Towards an Extractive Summarization System Using Sentence Vectors and Clustering - PPT Presentation

John Cadigan David Ellison Ethan Roday System Overview Document summarization system is organized as a multistep pipeline System Overview Two major components for content selection Feature selection step to generate sentence vectors ID: 660672

selection sentence overview vectors sentence selection vectors overview document rouge vector sentences means system preprocessing representations representation content original

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "Towards an Extractive Summarization Syst..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

Towards an Extractive Summarization System Using Sentence Vectors and Clustering

John

Cadigan

, David Ellison, Ethan RodaySlide2

System Overview

Document summarization system is organized as a multi-step pipeline.Slide3

System Overview

Two major components for content selection:

Feature selection step to generate sentence vectors

Clustering or classification for sentence selectionSlide4

Overview: Preprocessing

Extract raw text

Split into sentences

Tokenize each sentenceSlide5

Overview: Feature Selection

GloVe

vector lookup

Dense word vector

Average all word vectors to compute sentence vectorSlide6

Overview: Intermediate Representation

Unique ID for each sentence

Sentence ID Encodes:

Which document set (e.g. D0901A)

Which document

Which sentence

Example:

D0901A.1.2

Also keep track of:

Sentence embedding (

GloVe

average)

Original text as appears in documentSlide7

Overview: Content Selection

Cluster sentences (vectors) with k-means

Idea: most “representative” sentence is closest to center of clusterSlide8

Information Ordering

For each sentence, retrieve and order by original document ordering

This information is maintained in each sentence IDSlide9

Content Realization

Present each centroid sentence “as-is”

We’ve kept the original representation from input documentsSlide10

Quantitative results

ROUGE Scores

Analysis

Precision consistently higher than recall

On par with the k-means implementation of Li et al (2011).

P

R

F

Li et al. 2011 (k-means

R)

ROUGE-1

.228

.184

.202

.219

ROUGE-2

.052

.042

.046

.037

ROUGE-3

.016

.013

.014

-

ROUGE-4

.005

.004

.004

-Slide11

Qualitative results

Decent example summary

Flowering arrow bamboo is threatening a colony of endangered giant pandas in China but experts have come up with a plan to save them , state media said Monday .

Giant Panda

By the end of 2004 , arrow bamboo , the favorite food of giants , had blossomed on 7,420 hectares at the

Baishuijiang

Bad example summary

“At Littleton , America got a glimpse of the last stop on that train to hell America boarded decades ago when we declared that God is dead and that each of us is his or her own god who can make up the rules as we go along .”

As the day progressed , the snow became thick mud and as many as 100 visitors at a time came to the top of the hill to look at the rough wood cross that had been staked into the ground there .

I think we have to learn from this and not let those kids die in vain .”

Analysis

Bias towards long sentences

Poor prioritization of sentences

Low variationSlide12

Discussion: Preprocessing

Lots of bad sentences in current summaries:

Quotes

Non-sentencesArticle metadata: New York --LRB—Stop words contribute significant noiseLack of term weighting also results in noisy vectors

tf-idf

and LLR are low-effort and high-returnSlide13

Discussion: Vector Representation

Vectors are an effective way to represent similarity:

Compare GLOVE vectors of verbs, subjects and objects

Integrate a LDA topic modelNER and co-referenceOther representations may be very effective components:

Probabilistic: KL divergence

Graph-based:

LexRankSlide14

Discussion: Selection Methods

Several drawbacks of k-means as selection algorithm:

What is k?

How to prioritize clusters?How to select from clusters? Could other representations overcome these flaws?Slide15

Conclusion

Three major directions for next steps:

Preprocessing and noise removal

Refining and augmenting vector representations

Experimenting with new selection algorithms