Fall 201819 7 Recurrent Neural Networks Some figures adapted from NNDL book Recurrent Neural Networks Noriko Tomuro 2 Recurrent Neural Networks RNNs RNN Training Loss Minimization Bidirectional RNNs ID: 720283
Download Presentation The PPT/PDF document "Noriko Tomuro 1 CSC 578 Neural Networks ..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Noriko Tomuro
1
CSC 578Neural Networks and Deep Learning
7. Recurrent Neural Networks
(Some figures adapted from
NDL book
by Goodfellow et al.
)Slide2
Sequence Data and Sequence Models
Noriko Tomuro
2
Sequence Data is any kind of data where the order matters. In sequence data, input is a sequence rather than independent values – input unit at position/time
t
is dependent on the input (sub)sequence before the unit (e.g.
0
through t-1).Some typical sequence data are:Time-series data – time is the order. e.g. stock prices in the last 3 monthsDNA sequence – position of nucleotide bases (As, Ts, Cs, and Gs).Text documents – position of words in a sentence matters.Signal processing (e.g. speech and music) – time is the order.Slide3
One of the tasks applied to sequence data,
Sequence Prediction, is to predict the position t’s value from the previous subsequence. So, data for sequence prediction has:
Input == values in the previous subsequence up to t-1Output == value at position t
Also the correspondence between input and output may not be one-to-one. Here are some patterns:
Noriko Tomuro
3
a b c d e f …Slide4
Noriko Tomuro
4
Then in sequence models, the values in the previous subsequence must be kept in the model since values are dependent
So the
model
must internally have facilities for
memory
and mechanisms to propagate the memory through the sequence. Slide5
1 Recurrent Neural Networks
Machine Learning
, Tom Mitchell
5
Recurrent Neural Networks (RNNs) use outputs of network units at time
t
as the input to other units at time
t+1. Because of the topologies, RNNs are often used in sequential modeling such as time series data.Also the information brought from time t to t+1 is essentially the context of the preceding input, and serves as the network’s internal
memory
.Slide6
Noriko Tomuro
6
A basic RNNs are essentially equivalent to feed-forward networks since recurrence can be
unfolded
(in time).Slide7
Noriko Tomuro
7
Information from time t-1 could be from its hidden node(s) or output depending on the architecture.
Accordingly the activation function will be different.
Elman network Jordan network
(less powerful)
Note: b and c are bias vectors. Also the neuron activation function could be something else besides tanh, such as ReLU.
Slide8
2 RNN Training
http://www.wildml.com/2015/10/recurrent-neural-networks-tutorial-part-3-backpropagation-through-time-and-vanishing-gradients/
8
Each sequence produces an error as the
sum
of the deviations of all target signals
from the corresponding activations computed by the network.
To measure the error at each time t, most of the loss functions used in feed-forward neural networks can be used:Negative likelihood (also often called cross-entropy):
Mean squared error (MSE)Slide9
“Deep Learning” by Goodfellow et al.
9
There is also another network architecture for training, called ‘teacher forcing’.
Rather than the values computed at hidden or output nodes, the information from the previous node uses the
correct, target output
(from the training data).Slide10
Machine Learning
, Tom Mitchell
10
For loss minimization, common approaches are:
Gradient Descent
The standard method is
BackPropagation
Through Time (BPTT), which is a generalization of the BP algorithm for feed-forward networks.Basically the error computed at the end of the (input) sequence, which is the sum of all errors in the sequence, is propagated backward through the ENTIRE sequence, e.g. For t=3,
where
Since the sequence could be long, oftentimes we clip the backward propagation by
truncating the backpropagation to a few steps
.
2.1 Loss MinimizationSlide11
Machine Learning
, Tom Mitchell
11Slide12
DNB book
12
Then the gradient on the various parameters become:
However, gradient descent suffers from the same
vanishing gradient
problem as the feed-forward networks.Slide13
http://www.wildml.com/2015/10/recurrent-neural-networks-tutorial-part-3-backpropagation-through-time-and-vanishing-gradients/
13Slide14
Machine Learning
, Tom Mitchell
14
Global optimization methods
“Training the weights in a neural network can be modeled as a non-linear
global optimization problem
.
Arbitrary global optimization techniques may then be used to minimize this target function.The most common global optimization method for training RNNs is Genetic Algorithms..” [Wikipedia]Slide15
Noriko Tomuro
15
3 Bidirectional RNNs
Bidirectional RNNs (BRNNs) combine an RNN that moves
forward
through time, beginning from the start of the sequence, with another RNN that moves
backward
through time, beginning from the end of the sequence.By using two time directions, input information from the past and future of the current time frame can be used unlike standard RNN which requires the delays for including future information.BRNNs can be trained using similar algorithms to RNNs, because the two directional neurons do not have any interactions. Slide16
Noriko Tomuro
16
4 Encoder-Decoder NNs
Generally speaking, encoder-Decoder networks learn the mapping from an input sequence to an output sequence.
With multilayer feed-forward networks, such networks are called ‘auto associators’:
Input and output could be the same (to learn the identity function -> compression) or different (e.g., classification with one-hot-vector output representation)Slide17
Noriko Tomuro
17
With recursive networks, an encoder-decoder architecture acts on sequence as the input/output unit, NOT a single unit/neuron. There is an RNN for encoding, and another RNN for decoding.
A hidden state connected from the end of the input layer essentially represents the context (variable C), or a semantic summary, of the input sequence, and it is connected to the decoder RNN.Slide18
Noriko Tomuro
18
5 Deep RNNs
RNNs can be made to deep networks in many ways. For example,Slide19
Noriko Tomuro
19
6 Recursive NNs
A
recursive
neural network is a kind of deep neural network created by applying
the same set of weights
recursively over a structured input.“In the most simple architecture, nodes are combined into parents using a weight matrix that is shared across the whole network, and a non-linearity such as tanh.” [Wikipedia]Slide20
Noriko Tomuro
20
If c
1
and c
2
are n-dimensional vector representation of nodes, their parent will also be an n-dimensional vector, calculated as
, where W is a
n
x 2
n
matrix.
Training
:
Typically, stochastic gradient descent (SGD) is used to train the network. The gradient is computed using
backpropagation through structure
(BPTS), a variant of backpropagation through time used for recurrent neural networks. [Wikipedia]
Slide21
Noriko Tomuro
21
7 Long Short-Term Memory (LSTM)
The idea for RNNs is to incorporate
dependencies
-- information from earlier in the input sequence, as the
context
or memory, in processing the current input.Long Short-Term Memory (LSTM) networks are a special kind of RNN, capable of learning long-term/distance dependencies.An LSTM network consists of LSTM units. A common LSTM unit is composed of a
context/state
cell
, an
input gate
, an
output gate and a forget gate.
The cell remembers values over arbitrary time intervals and the three gates regulate the flow of information into and out of the cell.
[Wikipedia]Slide22
http://localhost:8888/notebooks/Temp-Heaton/t81_558_class10_lstm.ipynb
22
The Big Picture
:
LSTM maintains an internal state and produces an output. The following diagram shows an LSTM unit over three time slices: the current time slice (t), as well as the previous (t-1) and next (t+1) slice:
C is the context value. Both the output and context values are always fed to the next time slice.Slide23
The sigmoid layer outputs numbers between 0-1 determine how much
each component should be let through. Pink X gate is point-wise multiplication.Slide24
http://colah.github.io/posts/2015-08-Understanding-LSTMs/
24
Step-by-step walk through:
Forget gate controls the info coming from h
t-1
for the new input
x
t
(resulting in 0-1 by sigmoid), for all internal state cells (
i
’s) for time
t
.
Input gate applies sigmoid to control which values (
i
’s) to keep this time
t
. Then tanh is applied to create the draft of the new context C
t
.
Slide25
Noriko Tomuro
25
(3) The old state C
t-1
is multiplied by f
t
, to forget the things in the previous context that we decided to forget earlier. Then we add
it∗C~t. This is the new candidate values, scaled by how much we decided to update each state value.
(4) We also decide which information to output (by filtering through the output gate). Also the hidden value for this time t is set by the output value multiplied by the tanh of the new context.
Slide26
RNN vs LSTMSlide27
http://colah.github.io/posts/2015-08-Understanding-LSTMs/
27
Gated Recurrent Unit (GRU)Slide28
GRUs also takes x
t
and h
t-1
as inputs. They perform some calculations and then pass along h
t
. What makes them different from LSTMs is that GRUs don't need the cell layer to pass values along. The calculations within each iteration insure that the h
t
values being passed along either retain a high amount of old information or are jump-started with a high amount of new information. Slide29
https://www.tensorflow.org/guide/keras/rnn
29
8 LSTM Code Example
Here is a simple example of a Sequential model that processes sequences of integers, embeds each integer into a 64-dimensional vector, then processes the sequence of vectors into 10 target categories.Slide30
https://www.tensorflow.org/guide/keras/rnn
30
The number of parameters in the
lstm
layer is
98,816
because
Each internal gate has 128. So each gate is a vector of length 128.To compute each internal gate, (1) values from the previous node (ht_1, 128 values) are concatenated with the input values (xt, 64 values), thus 192 values, are connected to (2) all of the units in the respective gate (128 values), plus (3) bias (128 values) (192 * 128) + 128 = 24,704.And there are four gates. So the total number of parameters 24,704 * 4 = 98,816
.Slide31
Noriko Tomuro
31
The data has 3 input variables (where ‘lookback’ time steps = 3) and 1 output variable. And 4 (hidden) LSTM units are chosen (for each time step/slice).
Note the task is here
regression
– the activation function of the output layer (just one node) is
linear
by default.Another example:https://machinelearningmastery.com/return-sequences-and-return-states-for-lstms-in-keras/