/
Extractive Summarization Extractive Summarization

Extractive Summarization - PowerPoint Presentation

danika-pritchard
danika-pritchard . @danika-pritchard
Follow
388 views
Uploaded On 2017-04-20

Extractive Summarization - PPT Presentation

John Cadigan David Ellison and Ethan Roday Approach Preprocessing and data cleanup Vectorization Kmeans Information ordering with the experts system CLASSYstyle content realization Raw Input ID: 539821

rouge content idf ate content rouge ate idf realization remove ordering sentence bill preprocessing means worst glove lda acquaint

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "Extractive Summarization" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

Extractive Summarization

John

Cadigan

, David Ellison, and Ethan RodaySlide2

ApproachPreprocessing and data cleanup

VectorizationK-means Information ordering with the experts system

CLASSY-style content realization

Raw Input

Docset

(XML)

<DOCSET_ID>/*.txt

(plain text representation)

Preprocessing:

Sentence splitting

Tokenization

“Junk” removal

Sentence Vectorization:

Compute

tf-idf

weighted average

GloVe vectorsCompute LDA topic weights

Content Selection:k-means clustering on sentence vectors

Information Ordering:Four-expert panel

Content Realization:CLASSY-style scrubbingUntokenization

Summaries

Term Weighting:Compute document-level tf-idf

Term weighting:Compute idf scores for all terms in ACQUAINT

ACQUAINT corpus

GloVe vectors

LDA:Compute LDA topic models over ACQUAINTSlide3

PreprocessingSlide4

PreprocessingSimple regex substitutions to remove non-contentRemove things like “ARVADA, Colo. (AP) –” at beginning of articleWith photo.By John T. McQuiston

QUESTIONS OR RERUNS: …The late-night supervisor is…Slide5

Content SelectionSlide6

Vectorization ChangesGloVe vectors are now tf-idf weighted averagesPreviously: unweighted averagestf-idf

is computed over the entire ACQUAINT corpusSlide7

K-means and information orderingK-means centroids were not orderedTried:Most similar to other centroidsMost contained sentencesIn this release, we used information ordering on top sentences with a cutoff totaling 100+ wordsSlide8

Content RealizationSlide9

Implemented CLASSY-style cleanup heuristics (from 2006 paper):Remove bylines, etc. (this is always done in preprocessing)Remove adverbs, limited list of conjunctions at BOSRemove ages (“Bill, 50, ate.”  “Bill ate.”) Remove relative clause attributions (“Bill, who already ate, ate again”

 “Bill ate again.”)Remove attributions, as long as it isn’t a direct quotation (“Bill said he already ate.”  “He already ate.”)

Untokenize sentences before presentation of summariesdid n’t  didn’t, …

untokenize punctuationContent RealizationSlide10

CLASSY 2006 Sentence Trimming configurationsSlide11

Content Realization: some issuesPossible quote manglingOddly placed commasToo-aggressive adverb removal?

“Physically, it’s the same town it was Monday.”“…the Guinean capital of Conakry was

unexpectedly closed Monday…”District Attorney Robert Johnson plans to meet with the Diallos shortly

before 2 p.m. , when the grand jury indictments are scheduled to be unsealed in open court .Police officers have rarely been convicted for killings that occurred while they were on duty.How

quickly

did he fall?Slide12

Quantitative ResultsSlide13

Quantitative Results

D4:

Devtest

Metric

Precision

Recall

F-Score

ROUGE-1

0.241

0.212

0.225

ROUGE-2

0.052

0.046

0.049

D4:

Evaltest

Metric

Precision

Recall

F-Score

ROUGE-1

0.260

0.239

0.248

ROUGE-2

0.0590.055

0.057Slide14

Game of Qualitative ResultsBest, worst and mediocreSlide15

MEDIOCREROUGE 1: 0.16461ROUGE 2: 0.04603

But for now, for the next several weeks, people seem able only to get through the worst of it, to handle the realization that some people are not coming back and that yes, things like this do happen here.

Students returned to classes Thursday at Chatfield High School, but the bloodbath at rival Columbine High haunted the halls.

Investigators, spending the day at the memorial service, were to resume their work this morning, conducting more interviews and eyeing the possibility of additional suspects in Tuesday's massacre.

Team members decided they wanted to play out the rest of the season.

Really long, non-specific first sentence

Variation of themesSlide16

WORST (D1030 ): ROUGE-1: 0.09921ROUGE-2: 0.01210

the current regulations have created a quagmire of consumer confusion and set up potential health crises that even industry officials say could hurt producers as well as users of herbal products.

`The main thing you want is someone who knows enough to keep you out of trouble,'' said Dr. John B.

Neeld

Jr., president of the American Society of Anesthesiologists.

While over-the-counter drugs are subject to Food and Drug Administration regulation, herbal supplements are assumed safe unless proved otherwise.

If the products were safe, companies could say what they wished, so long as they did not claim their products could prevent, treat or cure disease.

News-speak (not newspeak)

“quagmire”

“hurt producers”

Not that bad

It’s about FDA regulationsSlide17

Best: ROUGE-1: 0.41803 ROUGE-2: 0.17917

An Indonesian minister,

Aburizal

Bakrie, claimed last month the flow was a ``natural disaster'' unrelated to the drilling activities of a company,

Lapindo

Brantas

Inc

, which belongs to a group controlled by his family.

President

Susilo

Bambang

Yudhoyono has ordered Lapindo to pay 3.8 trillion rupiah -LRB- 420 million dollars -RRB- in compensation and costs related to the mud flow.

A gas well near Surabaya in East Java has spewed steaming mud since May last year, submerging villages, factories and fields and forcing more than 15,000 people to flee their homes. All the themes:MoneyDisaster

GovernmentCould improve orderingSlide18

DiscussionSlide19

DiscussionThe good:Content being selected is mostly relevantTopicality has improved over timeThe bad:Lack of thematic cohesion seems to predominate

Possibly a drawback of k-meansSlide20

DiscussionParameter tuning matters:tf scheme, idf scheme, GloVe

weight, LDA weight, kWorst and best devtest:

Devtest

best

Metric

Precision

Recall

F-Score

ROUGE-1

0.241

0.212

0.225

ROUGE-2

0.052

0.046

0.049

Devtest

worst

Metric

Precision

Recall

F-Score

ROUGE-1

0.183

0.149

0.163

ROUGE-20.035

0.028

0.031