/
Show and Tell: A Neural Image Caption Generator Show and Tell: A Neural Image Caption Generator

Show and Tell: A Neural Image Caption Generator - PowerPoint Presentation

trish-goza
trish-goza . @trish-goza
Follow
382 views
Uploaded On 2018-11-06

Show and Tell: A Neural Image Caption Generator - PPT Presentation

CVPR 2015 Presenters Tianlu Wang Y i n Zhang Oct ober 5 th Human A young girl asleep on the sofa cuddling a stuffed bear NIC A baby is asleep next to a teddy bear ID: 717800

information rnn lstm image rnn information image lstm cell state neural target output language input gate sentence nic encoder

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "Show and Tell: A Neural Image Caption Ge..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

Show and Tell: A Neural Image Caption Generator (CVPR 2015)

Presenters:

Tianlu

Wang

,

Y

i

n

Zhang

Oct

ober

5

thSlide2

Human: A young girl asleep on the sofa cuddling a stuffed bear.

NIC: A baby is asleep next to a teddy bear.

Neural Image Caption

(NIC)

Main Goal: automatically describe the content of an image using properly formed English sentences

Mathematically,

to

build

a

single

joint

model

that

takes

an

image

I

as

input,

and

is

trained

to

maximize

the

likelihood

p(

Sentence|Image

)

of

producing

a

target

sequence

of

wordsSlide3

Inspiration from Machine

Translation

task

The

target sentence is generated by maximizing the likelihood P(T|S), where T is the target language and S is the

source

language Use the Encoder - Decoder structureEncoder (RNN): transform the source language into a rich fixed length vectorDecoder (RNN): take the output of encoder as input and generates the target sentence

An

example

of

translating

words

written

in

source

language

”ABCD”

to

those

in

target

language

“XYZQ”Slide4

NIC Model Architecture

F

ollow

the

Encoder - Decoder structureEncoder (deep CNN): transform the image into a rich fixed length vectorDecoder (RNN): take the

output

of

encoder as input and generates the target sentenceSlide5

NIC Model Architecture

Choice

of

CNN: winner on the ILSVRC 2014 classification competitionChoice of RNN: LSTM RNN (Recurrent Neural Network with LSTM cell)

In

training

process, they left the CNN unchanged, only trained the RNN part.Slide6

RNN(Recurrent Neural Network)

Why? Sequential task: speech, text and video…

E.g. translate a word based on the previous one

Advantage: Pass information from one step to next, information persistence

How? Loops, multiple copies of same cell(module), passing a message to a successorWant to know more? http://karpathy.github.io/2015/05/21/rnn-effectiveness/ Slide7

RNN & LSTM

Why it’s better?

Long term dependency problem:

translation of the last word depends on the information of the first word… when gap between relevant information grows, RNN failsLong Short Term Memory Networks remembers information for long periods of time. Slide8

LSTM(Long Short Term Memory)

Cell state: information flows along it!

Gate: optionally let information throughSlide9

LSTM Cont.(forget gate)

input x

p

revious output h

f (vector, every element is 0 or 1)

decide what information to throw away from the cell stateSlide10

LSTM Cont.

d

ecide what values will be updated

create new candidate values

u

pdate the old cell state into new cell state

input gate

: decide what new information will be stored in cell state

p

ush the value to be between -1 and 1Slide11

LSTM Cont.(output gate)

d

ecide what parts of cell state we’ll output

o

utput the parts we decided toSlide12

Result

BLEU:

https

://

en.wikipedia.org/wiki/BLEU Slide13

Reference:Show and Tell: A Neural Image Caption Generator, Oriol

Vinyals

, Alexander

Toshev

, Samy Bengio, Dumitru Erhan

https://arxiv.org/pdf/1411.4555v2.pdf http://techtalks.tv/talks/show-and-tell-a-neural-image-caption-generator/61592/Understanding LSTM Networks, colah’s blog http://colah.github.io/posts/2015-08-Understanding-LSTMs/