/
Image Caption with Deep Learning Image Caption with Deep Learning

Image Caption with Deep Learning - PowerPoint Presentation

min-jolicoeur
min-jolicoeur . @min-jolicoeur
Follow
503 views
Uploaded On 2017-03-27

Image Caption with Deep Learning - PPT Presentation

Yulia Kogan and Ron Shiff 19062016 References J Mao W Xu Y Yang J Wang and A L Yuille Explain images with multimodal recurrent neural networks arXiv preprint arXiv14101090 2014 ID: 530215

arxiv image multimodal sentence image arxiv sentence multimodal preprint evaluation model karpathy 2015 decoder rnn neural visual encoder 1411 vinyals cat

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "Image Caption with Deep Learning" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

Image Caption with Deep Learning

Yulia Kogan and

Ron

Shiff

19.06.2016Slide2

References

J. Mao, W. Xu, Y. Yang, J. Wang, and A. L. Yuille. Explain images with multimodal recurrent neural networks.

arXiv preprint arXiv:1410.1090, 2014

R. Kiros, R. Salakhutdinov, and R. S. Zemel. Unifying visual-semantic embeddings with multimodal neural language models.

arXiv preprint arXiv:1411.2539, 2014

J. Donahue, L. A. Hendricks, S. Guadarrama, M. Rohrbach, S. Venugopalan, K. Saenko, and T. Darrell. Long-term recurrent convolutional networks for visual recognition and description.

arXiv preprint arXiv:1411.4389, 2015

O. Vinyals, A. Toshev, S. Bengio, and D. Erhan. Show and tell: A neural image caption generator.

arXiv preprint arXiv:1411.4555, 2015.

A. Karpathy, L. Fei-Fei. Deep Visual-Semantic alignment for generating Image Descriptions.

CVPR 2015 (Oral)Slide3

Structure of the talk

Problem formulation

Models: RNN + CNN

Architecture details

Evaluation

problems

Results

Vector

arithmetic

Dense

image caption (Karpathy et al.)Slide4

I think it’s a David Bowie holding cat and he seems Slide5

Problem formulation

Useful for

Early childhood education

Foreign language education

Visually impaired people

Image retrieval and image searchSlide6

Problem formulation

Hard task:

Objects (cat, dog)

Attributes (white, furry)

Relations (playing together)

Location (in a room)

Describe it in proper languageSlide7

(Generated):

A square with burning street lamps and a street in the foregroundSlide8

Different tasks

Image:

Image

description

(produce new sentence)

Sentence

retrieval (pick the best sentence)

Sentence

ranking (pick the best sentence

)

Image

retrieval

(pick the best image)

Video

(Donahue et al.):

Activity recognition

(short label)

Video description

(produce new sentence)Slide9

Models: RNN + CNN

How to combine image and and sentence?

RNN +

CNN:

Encoder-decoder model

Multimodal

layerSlide10

Encoder-decoder model: machine translationSlide11

Encoder-decoder model: image captionSlide12

Encoder-decoder model: Vinyals et al.Slide13

Encoder-decoder model in time

log-likelihood of a Word given Image and

Context:

:Slide14

Multimodal layerSlide15

Multimodal layer at time 0

Loss:

log-likelihood of a Word given Image and ContextSlide16

Architecture decisions

Model

RNN/LSTM

Type of non-linearity (sigmoid,

tanh, RELU

, etc)

Feed image to RNN on every step/once

Random initialization/pretrained models

How images and texts are

fedSlide17

Architecture details:

Mao et al.Slide18

Multimodal layer (Mao et al.): sentence + imageSlide19

Architecture details: Vinyals et al.Slide20

Architecture: Donahue et al.Slide21

Problems of evaluationSlide22

Summer medieval festival.

Two men are fighting with swords.

Knights are having a tournament.

Lots of people in colourful dresses on green grass.Slide23

Evaluation (sentence generation)

Human evaluation

Costly

Level of inter-human agreement is low (Vinyals et al.: 65%)

Multiple references for one image (usually 5)

Still not enough diversity

Not a lot of dataSlide24

Evaluation (sentence generation)

BLEU-N

score

(~ precision

)

BLEU-1:

adequacy

BLEU-2, BLEU-3: fluency

THERE IS A CATSlide25

BLEU problemsSlide26

Evaluation:

Retrieval and

Ranking

Recall@K

(K = 1, 5, 10): # of images for which the correct sentence is retrieved in the top-K.

Medr

: median rank of the the first correct sentence (low is good

).Slide27

Results: Vinyals et alSlide28

Results: Kiros at al.Slide29

Results (pictures)

I think it’s a dog that is standing in the dirt. Slide30

I think it’s a David Bowie holding cat and he seems Slide31

I think it’s a cat sitting on a table.Slide32

I’m not really confident but I think it’s a close up of a cat looking at the camera.Slide33

I’m not really confident but I think it’s a close up of a two giraffes near a tree.Slide34

Vector arithmetic

king – man + woman = queen

paris – france + poland = warsaw

word2vec (

http://deeplearner.fz-qqq.net/

)Slide35

Vector arithmetic (colors): Kiros et al.Slide36

Vector arithmetic (Structure): Kiros et al.Slide37

Karpathy et al.: dense captionsSlide38

Karpathy et al.: dense

captions (ranking)

Pretrain

RegionCNN

for object regions (instead of images)

Detect top

19 regions

(bounding boxes)

Learn

Sentence-image scoreSlide39

Karpathy et al.: dense

captions (ranking)

Learn

Sentence-image score

LossSlide40

Karpathy et al 2015.: dense

captions ranking

RCNN

+ BRNN:Slide41

Take-home message

Image caption is in good

shape

Sequential nature of RNN / LSTM

Encoder-decoder model / multimodal layer

Evaluation

problems

Vector arithmeticSlide42

References

J. Mao, W. Xu, Y. Yang, J. Wang, and A. L. Yuille. Explain images with multimodal recurrent neural networks.

arXiv preprint arXiv:1410.1090, 2014

R. Kiros, R. Salakhutdinov, and R. S. Zemel. Unifying visual-semantic embeddings with multimodal neural language models.

arXiv preprint arXiv:1411.2539, 2014

J. Donahue, L. A. Hendricks, S. Guadarrama, M. Rohrbach, S. Venugopalan, K. Saenko, and T. Darrell. Long-term recurrent convolutional networks for visual recognition and description.

arXiv preprint arXiv:1411.4389, 2015

O. Vinyals, A. Toshev, S. Bengio, and D. Erhan. Show and tell: A neural image caption generator.

arXiv preprint arXiv:1411.4555, 2015.

A. Karpathy, L. Fei-Fei. Deep Visual-Semantic alignment for generating Image Descriptions.

CVPR 2015 (Oral)