CVPR 2015 Presenters Tianlu Wang Y i n Zhang Oct ober 5 th Human A young girl asleep on the sofa cuddling a stuffed bear NIC A baby is asleep next to a teddy bear ID: 717800
Download Presentation The PPT/PDF document "Show and Tell: A Neural Image Caption Ge..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Show and Tell: A Neural Image Caption Generator (CVPR 2015)
Presenters:
Tianlu
Wang
,
Y
i
n
Zhang
Oct
ober
5
thSlide2
Human: A young girl asleep on the sofa cuddling a stuffed bear.
NIC: A baby is asleep next to a teddy bear.
Neural Image Caption
(NIC)
Main Goal: automatically describe the content of an image using properly formed English sentences
Mathematically,
to
build
a
single
joint
model
that
takes
an
image
I
as
input,
and
is
trained
to
maximize
the
likelihood
p(
Sentence|Image
)
of
producing
a
target
sequence
of
wordsSlide3
Inspiration from Machine
Translation
task
The
target sentence is generated by maximizing the likelihood P(T|S), where T is the target language and S is the
source
language Use the Encoder - Decoder structureEncoder (RNN): transform the source language into a rich fixed length vectorDecoder (RNN): take the output of encoder as input and generates the target sentence
An
example
of
translating
words
written
in
source
language
”ABCD”
to
those
in
target
language
“XYZQ”Slide4
NIC Model Architecture
F
ollow
the
Encoder - Decoder structureEncoder (deep CNN): transform the image into a rich fixed length vectorDecoder (RNN): take the
output
of
encoder as input and generates the target sentenceSlide5
NIC Model Architecture
Choice
of
CNN: winner on the ILSVRC 2014 classification competitionChoice of RNN: LSTM RNN (Recurrent Neural Network with LSTM cell)
In
training
process, they left the CNN unchanged, only trained the RNN part.Slide6
RNN(Recurrent Neural Network)
Why? Sequential task: speech, text and video…
E.g. translate a word based on the previous one
Advantage: Pass information from one step to next, information persistence
How? Loops, multiple copies of same cell(module), passing a message to a successorWant to know more? http://karpathy.github.io/2015/05/21/rnn-effectiveness/ Slide7
RNN & LSTM
Why it’s better?
Long term dependency problem:
translation of the last word depends on the information of the first word… when gap between relevant information grows, RNN failsLong Short Term Memory Networks remembers information for long periods of time. Slide8
LSTM(Long Short Term Memory)
Cell state: information flows along it!
Gate: optionally let information throughSlide9
LSTM Cont.(forget gate)
input x
p
revious output h
f (vector, every element is 0 or 1)
decide what information to throw away from the cell stateSlide10
LSTM Cont.
d
ecide what values will be updated
create new candidate values
u
pdate the old cell state into new cell state
input gate
: decide what new information will be stored in cell state
p
ush the value to be between -1 and 1Slide11
LSTM Cont.(output gate)
d
ecide what parts of cell state we’ll output
o
utput the parts we decided toSlide12
Result
BLEU:
https
://
en.wikipedia.org/wiki/BLEU Slide13
Reference:Show and Tell: A Neural Image Caption Generator, Oriol
Vinyals
, Alexander
Toshev
, Samy Bengio, Dumitru Erhan
https://arxiv.org/pdf/1411.4555v2.pdf http://techtalks.tv/talks/show-and-tell-a-neural-image-caption-generator/61592/Understanding LSTM Networks, colah’s blog http://colah.github.io/posts/2015-08-Understanding-LSTMs/