Abigail See Peter J Liu Christopher D Manning Presented by Matan Eyal Agenda Introduction Word Embeddings RNNs SequencetoSequence Attention Pointer Networks Coverage Mechanism Introduction ID: 751563
Download Presentation The PPT/PDF document "Get To The Point: Summarization with Poi..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Get To The Point: Summarization with Pointer-Generator Networks
Abigail See, Peter J. Liu, Christopher D. Manning
Presented by: Matan
EyalSlide2
Agenda
IntroductionWord EmbeddingsRNNsSequence-to-SequenceAttention
Pointer Networks
Coverage MechanismSlide3
Introduction
Abstractive SummarizationInput: Long documentOutput: short, concise abstractive summarization.
While this lecture is indeed about abstractive summarization we might table
about
Machine translation
as well, this will be helpful when understanding the different architectures.Slide4
Word embeddings
Deep Learning Architectures are incapable of processing strings or plain text in their raw form.
They require numbers as inputs to perform any sort of job.
How can we encode words?Slide5
Word embeddings
One way of encoding words is one-hot encoding.Problems?Size?Orthogonality?
Etc..Slide6
Word embeddings
Vector space models (VSMs) represent (embed) words in a continuous vector space.Semantically similar words are mapped to nearby points (are embedded nearby each other).
Words that appear in the same
contexts share semantic meaning.
Count-based methods
Predictive modelsSlide7
Word embeddings – word2vec
computationally-efficient predictive model for learning word embeddings from raw textUses Continuous Bag-of-Words model (CBOW) and the Skip-Gram model.
CBOW predicts target words from source context words.
Skip-Gram predicts source context words from the target words.
Distributed Representations of Words and Phrases and their Compositionality (
mikolov
2013)Slide8
Neural Networks
(While this is far from a good example) Given a word, we want to find if it is a name of a girl or a boy.
Input is a word
Output is 1
iff
a girl’s name
How will be represent a word?Word Embeddings!Slide9
I understand Word Embeddings perfectly fine, what does it have to do with summarization?
Tomas
MikolovSlide10
Recurrent Neural Networks
Assume our goal is to find the sentiment of the input.Input is a sentence (list of words).
Output is most likely sentiment.
What is the problem?
Language is sequential!
Solution?
Recurrent Neural Networks (RNNs)Slide11
Recurrent Neural Networks
Input vector x, output vector y.But(!) also the entire history of inputs you’ve fed is transferred between Cells. Slide12
Recurrent Neural Networks
What about our sentiment analysis task? Slide13
Recurrent Neural Networks
LSTMs – Long short-term memory networks (
Hochreiter
,
Schmidhuber
, 1997)
GRUs - Gated Recurrent Neural Networks (Cho, et al., 2014)
Illustrations from
http://colah.github.io/posts/2015-08-Understanding-LSTMs/Slide14
Jürgen
Schmidhuber
Oh, awesome! what can I do with this?Slide15
Neural Machine Translation
Probabilistic perspective: Finding a target sentence
that maximizes the conditional probability of
given a source sentence
.
That is,
.
In neural machine translation, we fit a parameterized model to maximize this probability using a parallel training corpus.
Which parameterized model should we fit?
Sequence-to-Sequence!
Slide16
Sequence-to-SequenceThe Encoder
Reads the input sentence,
, into a vector
.
and
are some nonlinear functions.
Originaly
is an
, while
.
Slide17
Trained to predict the next word
given
The context vector
(remember we chose
)
All the previously predicted (or ground-truth)
words
That is, we want to maximize:
How can we model each conditional probability?
RNN!
is the hidden state of the Decoder at time
.
produce a probability distribution over the vocabulary using
a
softmax
over two linear layers:
Sequence-to-Sequence
The DecoderSlide18
The concept of
attention
is the most interesting recent architectural innovation in neural networks. (
Andrej
Karpathy
)Slide19
Learning to Align and TranslateThe Encoder
Bidirectional recurrent neural networks (Schuster, M. and
Paliwal
, K. K., 1997)
Now the annotation for
is
Slide20
In the original architecture,
Now,
Distinct context vector
for each
.
Learning to Align and Translate
The DecoderSlide21
(Context vector)
(Weights)
(Alignment model)
This scores the inputs around position
and the
output at position
match.
Learning to Align and Translate
The DecoderSlide22
reflects the importance of the annotation
with respect to the previous hidden state
in deciding the next state
and generating
.
By letting the decoder have an attention mechanism, we relieve the encoder from the burden of having to encode all information in the source sentence into a fixed length vector.
Learning to Align and TranslateSlide23
Learning to Align and TranslateSlide24
Experiment Settings
ACL WMT ’14, English-French parallel corpora80 minibatch stochastic gradient descent
Adadelta
(
Zeiler
, 2012) was used to automatically adapt the learning rates.
Training took approximately 5 daysBeam search at test timeSlide25
Results
BLEU scores of the trained models computed on the test set.Slide26
Alexander Rush
Summarization is also a mapping from input sequence to a (shorter) output sequence Slide27
Abstractive Summarization
Problems?Out Of Vocabulary wordsSlide28
Pointer-Generator Network
How to deal with OOV words?
Learn
is used as a soft switch to choose between generating a word from the vocabulary by sampling from
, or copying a word from the input sequence by sampling from the attention distribution
.
Sample from the extend vocabulary distribution:
Vinyals
et al., 2015Slide29
Pointer-Generator NetworkSlide30
Coverage Mechanism
Repetition is a common problem for sequence-to-sequence models
Maintain coverage vector
The (unnormalized) distribution over the source document words
In order to ensure the attention mechanism is informed of its previous decisions Add this to the alignment model:
Loss function
Tu et al., 2016Slide31
Results
ROUGE scores of the trained models on the CNN-Daily Mail dataset, computed on the test set.Slide32