Yulia Kogan and Ron Shiff 19062016 References J Mao W Xu Y Yang J Wang and A L Yuille Explain images with multimodal recurrent neural networks arXiv preprint arXiv14101090 2014 ID: 530215
Download Presentation The PPT/PDF document "Image Caption with Deep Learning" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Image Caption with Deep Learning
Yulia Kogan and
Ron
Shiff
19.06.2016Slide2
References
J. Mao, W. Xu, Y. Yang, J. Wang, and A. L. Yuille. Explain images with multimodal recurrent neural networks.
arXiv preprint arXiv:1410.1090, 2014
R. Kiros, R. Salakhutdinov, and R. S. Zemel. Unifying visual-semantic embeddings with multimodal neural language models.
arXiv preprint arXiv:1411.2539, 2014
J. Donahue, L. A. Hendricks, S. Guadarrama, M. Rohrbach, S. Venugopalan, K. Saenko, and T. Darrell. Long-term recurrent convolutional networks for visual recognition and description.
arXiv preprint arXiv:1411.4389, 2015
O. Vinyals, A. Toshev, S. Bengio, and D. Erhan. Show and tell: A neural image caption generator.
arXiv preprint arXiv:1411.4555, 2015.
A. Karpathy, L. Fei-Fei. Deep Visual-Semantic alignment for generating Image Descriptions.
CVPR 2015 (Oral)Slide3
Structure of the talk
Problem formulation
Models: RNN + CNN
Architecture details
Evaluation
problems
Results
Vector
arithmetic
Dense
image caption (Karpathy et al.)Slide4
I think it’s a David Bowie holding cat and he seems Slide5
Problem formulation
Useful for
Early childhood education
Foreign language education
Visually impaired people
Image retrieval and image searchSlide6
Problem formulation
Hard task:
Objects (cat, dog)
Attributes (white, furry)
Relations (playing together)
Location (in a room)
Describe it in proper languageSlide7
(Generated):
A square with burning street lamps and a street in the foregroundSlide8
Different tasks
Image:
Image
description
(produce new sentence)
Sentence
retrieval (pick the best sentence)
Sentence
ranking (pick the best sentence
)
Image
retrieval
(pick the best image)
Video
(Donahue et al.):
Activity recognition
(short label)
Video description
(produce new sentence)Slide9
Models: RNN + CNN
How to combine image and and sentence?
RNN +
CNN:
Encoder-decoder model
Multimodal
layerSlide10
Encoder-decoder model: machine translationSlide11
Encoder-decoder model: image captionSlide12
Encoder-decoder model: Vinyals et al.Slide13
Encoder-decoder model in time
log-likelihood of a Word given Image and
Context:
:Slide14
Multimodal layerSlide15
Multimodal layer at time 0
Loss:
log-likelihood of a Word given Image and ContextSlide16
Architecture decisions
Model
RNN/LSTM
Type of non-linearity (sigmoid,
tanh, RELU
, etc)
Feed image to RNN on every step/once
Random initialization/pretrained models
How images and texts are
fedSlide17
Architecture details:
Mao et al.Slide18
Multimodal layer (Mao et al.): sentence + imageSlide19
Architecture details: Vinyals et al.Slide20
Architecture: Donahue et al.Slide21
Problems of evaluationSlide22
Summer medieval festival.
Two men are fighting with swords.
Knights are having a tournament.
Lots of people in colourful dresses on green grass.Slide23
Evaluation (sentence generation)
Human evaluation
Costly
Level of inter-human agreement is low (Vinyals et al.: 65%)
Multiple references for one image (usually 5)
Still not enough diversity
Not a lot of dataSlide24
Evaluation (sentence generation)
BLEU-N
score
(~ precision
)
BLEU-1:
adequacy
BLEU-2, BLEU-3: fluency
THERE IS A CATSlide25
BLEU problemsSlide26
Evaluation:
Retrieval and
Ranking
Recall@K
(K = 1, 5, 10): # of images for which the correct sentence is retrieved in the top-K.
Medr
: median rank of the the first correct sentence (low is good
).Slide27
Results: Vinyals et alSlide28
Results: Kiros at al.Slide29
Results (pictures)
I think it’s a dog that is standing in the dirt. Slide30
I think it’s a David Bowie holding cat and he seems Slide31
I think it’s a cat sitting on a table.Slide32
I’m not really confident but I think it’s a close up of a cat looking at the camera.Slide33
I’m not really confident but I think it’s a close up of a two giraffes near a tree.Slide34
Vector arithmetic
king – man + woman = queen
paris – france + poland = warsaw
word2vec (
http://deeplearner.fz-qqq.net/
)Slide35
Vector arithmetic (colors): Kiros et al.Slide36
Vector arithmetic (Structure): Kiros et al.Slide37
Karpathy et al.: dense captionsSlide38
Karpathy et al.: dense
captions (ranking)
Pretrain
RegionCNN
for object regions (instead of images)
Detect top
19 regions
(bounding boxes)
Learn
Sentence-image scoreSlide39
Karpathy et al.: dense
captions (ranking)
Learn
Sentence-image score
LossSlide40
Karpathy et al 2015.: dense
captions ranking
RCNN
+ BRNN:Slide41
Take-home message
Image caption is in good
shape
Sequential nature of RNN / LSTM
Encoder-decoder model / multimodal layer
Evaluation
problems
Vector arithmeticSlide42
References
J. Mao, W. Xu, Y. Yang, J. Wang, and A. L. Yuille. Explain images with multimodal recurrent neural networks.
arXiv preprint arXiv:1410.1090, 2014
R. Kiros, R. Salakhutdinov, and R. S. Zemel. Unifying visual-semantic embeddings with multimodal neural language models.
arXiv preprint arXiv:1411.2539, 2014
J. Donahue, L. A. Hendricks, S. Guadarrama, M. Rohrbach, S. Venugopalan, K. Saenko, and T. Darrell. Long-term recurrent convolutional networks for visual recognition and description.
arXiv preprint arXiv:1411.4389, 2015
O. Vinyals, A. Toshev, S. Bengio, and D. Erhan. Show and tell: A neural image caption generator.
arXiv preprint arXiv:1411.4555, 2015.
A. Karpathy, L. Fei-Fei. Deep Visual-Semantic alignment for generating Image Descriptions.
CVPR 2015 (Oral)