/
Get To The Point: Summarization with Pointer-Generator Networks Get To The Point: Summarization with Pointer-Generator Networks

Get To The Point: Summarization with Pointer-Generator Networks - PowerPoint Presentation

calandra-battersby
calandra-battersby . @calandra-battersby
Follow
359 views
Uploaded On 2019-02-12

Get To The Point: Summarization with Pointer-Generator Networks - PPT Presentation

Abigail See Peter J Liu Christopher D Manning Presented by Matan Eyal Agenda Introduction Word Embeddings RNNs SequencetoSequence Attention Pointer Networks Coverage Mechanism Introduction ID: 751563

word words neural networks words word networks neural sequence vector model learning embeddings recurrent input summarization models sentence align

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "Get To The Point: Summarization with Poi..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

Get To The Point: Summarization with Pointer-Generator Networks

Abigail See, Peter J. Liu, Christopher D. Manning

Presented by: Matan

EyalSlide2

Agenda

IntroductionWord EmbeddingsRNNsSequence-to-SequenceAttention

Pointer Networks

Coverage MechanismSlide3

Introduction

Abstractive SummarizationInput: Long documentOutput: short, concise abstractive summarization.

While this lecture is indeed about abstractive summarization we might table

about

Machine translation

as well, this will be helpful when understanding the different architectures.Slide4

Word embeddings

Deep Learning Architectures are incapable of processing strings or plain text in their raw form.

They require numbers as inputs to perform any sort of job.

How can we encode words?Slide5

Word embeddings

One way of encoding words is one-hot encoding.Problems?Size?Orthogonality?

Etc..Slide6

Word embeddings

Vector space models (VSMs) represent (embed) words in a continuous vector space.Semantically similar words are mapped to nearby points (are embedded nearby each other).

Words that appear in the same

contexts share semantic meaning.

Count-based methods 

Predictive modelsSlide7

Word embeddings – word2vec

computationally-efficient predictive model for learning word embeddings from raw textUses Continuous Bag-of-Words model (CBOW) and the Skip-Gram model.

CBOW predicts target words from source context words.

Skip-Gram predicts source context words from the target words.

Distributed Representations of Words and Phrases and their Compositionality (

mikolov

2013)Slide8

Neural Networks

(While this is far from a good example) Given a word, we want to find if it is a name of a girl or a boy.

Input is a word

Output is 1

iff

a girl’s name

How will be represent a word?Word Embeddings!Slide9

I understand Word Embeddings perfectly fine, what does it have to do with summarization?

Tomas

MikolovSlide10

Recurrent Neural Networks

Assume our goal is to find the sentiment of the input.Input is a sentence (list of words).

Output is most likely sentiment.

What is the problem?

Language is sequential!

Solution?

Recurrent Neural Networks (RNNs)Slide11

Recurrent Neural Networks

Input vector x, output vector y.But(!) also the entire history of inputs you’ve fed is transferred between Cells. Slide12

Recurrent Neural Networks

What about our sentiment analysis task? Slide13

Recurrent Neural Networks

LSTMs – Long short-term memory networks (

Hochreiter

,

Schmidhuber

, 1997)

GRUs - Gated Recurrent Neural Networks (Cho, et al., 2014)

Illustrations from

http://colah.github.io/posts/2015-08-Understanding-LSTMs/Slide14

Jürgen

Schmidhuber

Oh, awesome! what can I do with this?Slide15

Neural Machine Translation

Probabilistic perspective: Finding a target sentence

that maximizes the conditional probability of

given a source sentence

.

That is,

.

In neural machine translation, we fit a parameterized model to maximize this probability using a parallel training corpus.

Which parameterized model should we fit?

Sequence-to-Sequence!

 Slide16

Sequence-to-SequenceThe Encoder

Reads the input sentence,

, into a vector

.

and

are some nonlinear functions.

Originaly

is an

, while

.

 Slide17

Trained to predict the next word

given

The context vector

(remember we chose

)

All the previously predicted (or ground-truth)

words

That is, we want to maximize:

How can we model each conditional probability?

RNN!

is the hidden state of the Decoder at time

.

produce a probability distribution over the vocabulary using

a

softmax

over two linear layers:

 

Sequence-to-Sequence

The DecoderSlide18

The concept of 

attention

 is the most interesting recent architectural innovation in neural networks. (

Andrej

Karpathy

)Slide19

Learning to Align and TranslateThe Encoder

Bidirectional recurrent neural networks (Schuster, M. and

Paliwal

, K. K., 1997)

Now the annotation for

is

 Slide20

In the original architecture,

Now,

Distinct context vector

for each

.

 

Learning to Align and Translate

The DecoderSlide21

(Context vector)

(Weights)

(Alignment model)

This scores the inputs around position

and the

output at position

match.

 

Learning to Align and Translate

The DecoderSlide22

reflects the importance of the annotation

with respect to the previous hidden state

in deciding the next state

and generating

.

By letting the decoder have an attention mechanism, we relieve the encoder from the burden of having to encode all information in the source sentence into a fixed length vector.

 

Learning to Align and TranslateSlide23

Learning to Align and TranslateSlide24

Experiment Settings

ACL WMT ’14, English-French parallel corpora80 minibatch stochastic gradient descent

Adadelta

(

Zeiler

, 2012) was used to automatically adapt the learning rates.

Training took approximately 5 daysBeam search at test timeSlide25

Results

BLEU scores of the trained models computed on the test set.Slide26

Alexander Rush

Summarization is also a mapping from input sequence to a (shorter) output sequence Slide27

Abstractive Summarization

Problems?Out Of Vocabulary wordsSlide28

Pointer-Generator Network

How to deal with OOV words?

Learn

is used as a soft switch to choose between generating a word from the vocabulary by sampling from

, or copying a word from the input sequence by sampling from the attention distribution

.

Sample from the extend vocabulary distribution:

 

Vinyals

et al., 2015Slide29

Pointer-Generator NetworkSlide30

Coverage Mechanism

Repetition is a common problem for sequence-to-sequence models

Maintain coverage vector

The (unnormalized) distribution over the source document words

In order to ensure the attention mechanism is informed of its previous decisions Add this to the alignment model:

Loss function

 

Tu et al., 2016Slide31

Results

ROUGE scores of the trained models on the CNN-Daily Mail dataset, computed on the test set.Slide32