/
Noriko Tomuro 1 CSC 578 Neural Networks and Deep Learning Noriko Tomuro 1 CSC 578 Neural Networks and Deep Learning

Noriko Tomuro 1 CSC 578 Neural Networks and Deep Learning - PowerPoint Presentation

natalia-silvester
natalia-silvester . @natalia-silvester
Follow
360 views
Uploaded On 2018-11-07

Noriko Tomuro 1 CSC 578 Neural Networks and Deep Learning - PPT Presentation

Fall 201819 7 Recurrent Neural Networks Some figures adapted from NNDL book Recurrent Neural Networks Noriko Tomuro 2 Recurrent Neural Networks RNNs RNN Training Loss Minimization Bidirectional RNNs ID: 720283

networks time sequence input time networks input sequence tomuro output noriko lstm neural rnns network rnn information recurrent context

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "Noriko Tomuro 1 CSC 578 Neural Networks ..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

Noriko Tomuro

1

CSC 578Neural Networks and Deep Learning

7. Recurrent Neural Networks

(Some figures adapted from

NDL book

by Goodfellow et al.

)Slide2

Sequence Data and Sequence Models

Noriko Tomuro

2

Sequence Data is any kind of data where the order matters. In sequence data, input is a sequence rather than independent values – input unit at position/time

t

is dependent on the input (sub)sequence before the unit (e.g.

0

through t-1).Some typical sequence data are:Time-series data – time is the order. e.g. stock prices in the last 3 monthsDNA sequence – position of nucleotide bases (As, Ts, Cs, and Gs).Text documents – position of words in a sentence matters.Signal processing (e.g. speech and music) – time is the order.Slide3

One of the tasks applied to sequence data,

Sequence Prediction, is to predict the position t’s value from the previous subsequence. So, data for sequence prediction has:

Input == values in the previous subsequence up to t-1Output == value at position t

Also the correspondence between input and output may not be one-to-one. Here are some patterns:

Noriko Tomuro

3

a b c d e f …Slide4

Noriko Tomuro

4

Then in sequence models, the values in the previous subsequence must be kept in the model since values are dependent

 So the

model

must internally have facilities for

memory

and mechanisms to propagate the memory through the sequence. Slide5

1 Recurrent Neural Networks

Machine Learning

, Tom Mitchell

5

Recurrent Neural Networks (RNNs) use outputs of network units at time

t

as the input to other units at time

t+1. Because of the topologies, RNNs are often used in sequential modeling such as time series data.Also the information brought from time t to t+1 is essentially the context of the preceding input, and serves as the network’s internal

memory

.Slide6

Noriko Tomuro

6

A basic RNNs are essentially equivalent to feed-forward networks since recurrence can be

unfolded

(in time).Slide7

Noriko Tomuro

7

Information from time t-1 could be from its hidden node(s) or output depending on the architecture.

Accordingly the activation function will be different.

Elman network Jordan network

(less powerful)

Note: b and c are bias vectors. Also the neuron activation function could be something else besides tanh, such as ReLU.

 

 Slide8

2 RNN Training

http://www.wildml.com/2015/10/recurrent-neural-networks-tutorial-part-3-backpropagation-through-time-and-vanishing-gradients/

8

Each sequence produces an error as the

sum

of the deviations of all target signals

from the corresponding activations computed by the network.

To measure the error at each time t, most of the loss functions used in feed-forward neural networks can be used:Negative likelihood (also often called cross-entropy):

Mean squared error (MSE)Slide9

“Deep Learning” by Goodfellow et al.

9

There is also another network architecture for training, called ‘teacher forcing’.

Rather than the values computed at hidden or output nodes, the information from the previous node uses the

correct, target output

(from the training data).Slide10

Machine Learning

, Tom Mitchell

10

For loss minimization, common approaches are:

Gradient Descent

The standard method is

BackPropagation

Through Time (BPTT), which is a generalization of the BP algorithm for feed-forward networks.Basically the error computed at the end of the (input) sequence, which is the sum of all errors in the sequence, is propagated backward through the ENTIRE sequence, e.g. For t=3,

where

Since the sequence could be long, oftentimes we clip the backward propagation by

truncating the backpropagation to a few steps

.

2.1 Loss MinimizationSlide11

Machine Learning

, Tom Mitchell

11Slide12

DNB book

12

Then the gradient on the various parameters become:

However, gradient descent suffers from the same

vanishing gradient

problem as the feed-forward networks.Slide13

http://www.wildml.com/2015/10/recurrent-neural-networks-tutorial-part-3-backpropagation-through-time-and-vanishing-gradients/

13Slide14

Machine Learning

, Tom Mitchell

14

Global optimization methods

“Training the weights in a neural network can be modeled as a non-linear

global optimization problem

.

Arbitrary global optimization techniques may then be used to minimize this target function.The most common global optimization method for training RNNs is Genetic Algorithms..” [Wikipedia]Slide15

Noriko Tomuro

15

3 Bidirectional RNNs

Bidirectional RNNs (BRNNs) combine an RNN that moves

forward

through time, beginning from the start of the sequence, with another RNN that moves

backward

through time, beginning from the end of the sequence.By using two time directions, input information from the past and future of the current time frame can be used unlike standard RNN which requires the delays for including future information.BRNNs can be trained using similar algorithms to RNNs, because the two directional neurons do not have any interactions. Slide16

Noriko Tomuro

16

4 Encoder-Decoder NNs

Generally speaking, encoder-Decoder networks learn the mapping from an input sequence to an output sequence.

With multilayer feed-forward networks, such networks are called ‘auto associators’:

Input and output could be the same (to learn the identity function -> compression) or different (e.g., classification with one-hot-vector output representation)Slide17

Noriko Tomuro

17

With recursive networks, an encoder-decoder architecture acts on sequence as the input/output unit, NOT a single unit/neuron. There is an RNN for encoding, and another RNN for decoding.

A hidden state connected from the end of the input layer essentially represents the context (variable C), or a semantic summary, of the input sequence, and it is connected to the decoder RNN.Slide18

Noriko Tomuro

18

5 Deep RNNs

RNNs can be made to deep networks in many ways. For example,Slide19

Noriko Tomuro

19

6 Recursive NNs

A

recursive

neural network is a kind of deep neural network created by applying

the same set of weights

recursively over a structured input.“In the most simple architecture, nodes are combined into parents using a weight matrix that is shared across the whole network, and a non-linearity such as tanh.” [Wikipedia]Slide20

Noriko Tomuro

20

If c

1

and c

2

are n-dimensional vector representation of nodes, their parent will also be an n-dimensional vector, calculated as

, where W is a

n

x 2

n

matrix.

Training

:

Typically, stochastic gradient descent (SGD) is used to train the network. The gradient is computed using

backpropagation through structure

(BPTS), a variant of backpropagation through time used for recurrent neural networks. [Wikipedia]

 Slide21

Noriko Tomuro

21

7 Long Short-Term Memory (LSTM)

The idea for RNNs is to incorporate

dependencies

-- information from earlier in the input sequence, as the

context

or memory, in processing the current input.Long Short-Term Memory (LSTM) networks are a special kind of RNN, capable of learning long-term/distance dependencies.An LSTM network consists of LSTM units. A common LSTM unit is composed of a

context/state

cell

, an

input gate

, an

output gate and a forget gate.

The cell remembers values over arbitrary time intervals and the three gates regulate the flow of information into and out of the cell.

[Wikipedia]Slide22

http://localhost:8888/notebooks/Temp-Heaton/t81_558_class10_lstm.ipynb

22

The Big Picture

:

LSTM maintains an internal state and produces an output. The following diagram shows an LSTM unit over three time slices: the current time slice (t), as well as the previous (t-1) and next (t+1) slice:

C is the context value. Both the output and context values are always fed to the next time slice.Slide23

The sigmoid layer outputs numbers between 0-1 determine how much

each component should be let through. Pink X gate is point-wise multiplication.Slide24

http://colah.github.io/posts/2015-08-Understanding-LSTMs/

24

Step-by-step walk through:

Forget gate controls the info coming from h

t-1

for the new input

x

t

(resulting in 0-1 by sigmoid), for all internal state cells (

i

’s) for time

t

.

Input gate applies sigmoid to control which values (

i

’s) to keep this time

t

. Then tanh is applied to create the draft of the new context C

t

.

 Slide25

Noriko Tomuro

25

(3) The old state C

t-1

is multiplied by f

t

, to forget the things in the previous context that we decided to forget earlier. Then we add

it∗C~t. This is the new candidate values, scaled by how much we decided to update each state value.

(4) We also decide which information to output (by filtering through the output gate). Also the hidden value for this time t is set by the output value multiplied by the tanh of the new context.

 Slide26

RNN vs LSTMSlide27

http://colah.github.io/posts/2015-08-Understanding-LSTMs/

27

Gated Recurrent Unit (GRU)Slide28

GRUs also takes x

t

and h

t-1

as inputs. They perform some calculations and then pass along h

t

. What makes them different from LSTMs is that GRUs don't need the cell layer to pass values along. The calculations within each iteration insure that the h

t

values being passed along either retain a high amount of old information or are jump-started with a high amount of new information. Slide29

https://www.tensorflow.org/guide/keras/rnn

29

8 LSTM Code Example

Here is a simple example of a Sequential model that processes sequences of integers, embeds each integer into a 64-dimensional vector, then processes the sequence of vectors into 10 target categories.Slide30

https://www.tensorflow.org/guide/keras/rnn

30

The number of parameters in the

lstm

layer is

98,816

because

Each internal gate has 128. So each gate is a vector of length 128.To compute each internal gate, (1) values from the previous node (ht_1, 128 values) are concatenated with the input values (xt, 64 values), thus 192 values, are connected to (2) all of the units in the respective gate (128 values), plus (3) bias (128 values)  (192 * 128) + 128 = 24,704.And there are four gates. So the total number of parameters  24,704 * 4 = 98,816

.Slide31

Noriko Tomuro

31

The data has 3 input variables (where ‘lookback’ time steps = 3) and 1 output variable. And 4 (hidden) LSTM units are chosen (for each time step/slice).

Note the task is here

regression

– the activation function of the output layer (just one node) is

linear

by default.Another example:https://machinelearningmastery.com/return-sequences-and-return-states-for-lstms-in-keras/