1 Recurrent Networks Some problems require previous historycontext in order to be able to give proper output speech recognition stock forecasting target tracking etc One way to do that is to just provide all the necessary context in one snapshot and use standard learning ID: 383003
Download Presentation The PPT/PDF document "Recurrent Neural Networks" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Recurrent Neural Networks
1
Recurrent Networks
Some problems require previous history/context in order to be able to give proper output (speech recognition, stock forecasting, target tracking, etc.One way to do that is to just provide all the necessary context in one "snap-shot" and use standard learningHow big should the snap-shot be? Varies for different instances of the problem.Slide2
Recurrent Networks
Another option is to use a recurrent neural network which lets the network dynamically learn how much context it needs in order to solve the problemSpeech Example – Vowels vs Consonants, etc.
Acts like a state machine that will give different outputs for the current input depending on the current stateRecurrent nets must learn and use this state/context information in order to get high accuracy on the taskTemporal Deep Network
2Recurrent Neural NetworksSlide3
Recurrent Training Data
Time SeriesCurrent target dependent on some combination of current and past inputs
Recurrent Neural Networks
3
time
x
Target
1
x
1
y
1
2
x
2
y
2
3
x
3
y
3
4
x
4
y
4
5
x
5
y
5
6
x
6
y
6
7
x
7
y
7Slide4
Recurrent Networks
Partially and Fully recurrent networks – feed forward vs. relaxation nets
Parameter sharing and arbitrary length stream – see figure
How to train?Elman Training – simple recurrent networks, can use standard BP trainingBPTT – Backpropagation through time – can learn further back, must pick depth
Real-time recurrent learning, LSTM, etc.
4
Recurrent Neural NetworksSlide5
5
Recurrent Neural NetworksSlide6
Recurrent Network Variations
This network can theoretically learn contexts arbitrarily far backMany structural variationsElman/Simple Net
Jordan NetMixedContext sub-blocks, etc.Multiple hidden/context layers, etc.How do we learn the weights?
Recurrent Neural Networks6Slide7
Simple Recurrent Training – Elman Training
Can think of net as just being a normal MLP structure where part of the input happens to be a copy of the last set of state/hidden node activations. The MLP itself does not even need to be aware that the context inputs are coming from the hidden layer
Then can train with standard BP trainingWhile network can theoretically look back arbitrarily far in time, Elman learning gradient goes back only 1 step in time, thus limited in the context it can learn
Would if current output depended on input 2 time steps backCan still be useful for applications with short term dependenciesRecurrent Neural Networks
7Slide8
BPTT –
Backprop Through Time
BPTT allows us to look back further as we train
However we have to pre-specify a value k, which is the maximum that learning will look backDuring training we unfold the network in time as if it were a standard feedforward network with k layersBut where the weights of each unfolded layer are the sameWe then train the unfolded k layer feedforward net with standard BP (or deep net variations)
Execution still happens with the actual recurrent version
Is not knowing
k
apriori
that bad? How do you choose it?
Cross Validation, just like finding best number of hidden nodes, etc., thus we can find a good
k
fairly reasonably for a given task
But problematic if the amount of state needed varies a lot
Recurrent Neural Networks
8Slide9
9
k
is the number of feedback/context blocks in the unfolded net.
Note k=1 is just standard MLP with no feedback1st block h(0) activations are just initialized to a constant or 0 so
k
=1 is still same as standard MLP, so just leave it out for feedforward MLP
Last context block is
h
(
k
-1)
k
=2 is Elman trainingSlide10
BPTT - Unfolding in Time (
k
=3) with output connections
Weights at each layer are maintained as exact copies
Input
k
Output
k
Input
2
Output
2
Input
1
Output
1
Input
k
Output
k
one step
time delay
10
Recurrent Neural Networks
Hidden
Context
Hidden
Context
Hidden
Context
Hidden
Context
Initial ContextSlide11
BPTT - Unfolding in Time (
k
=3) with output connections
Weights at each layer are maintained as exact copies
Input
k
Output
k
Input
2
Output
2
Input
1
Output
1
Input
k
Output
k
one step
time delay
one step
time delay
11
Recurrent Neural NetworksSlide12
Synthetic Data Set
Delayed Parity Task - DparityThis task has a single time series input of random bits. The output label is the parity (even) of
n arbitrarily delayed (but consistent) previous inputs. For example, for Dparity(0,2,5) the label of each instance would be set to the parity of the current input, the input 2 steps back, and the input 5 steps back.
Dparity(0,1) is the simplest version where the output is the XOR of the current input and the most recent input Dparity-to-ARFF appUser enters # of instances wanted, random seed, and a vector of the n delays (and optionally a noise parameter?)The app returns an ARFF file of this task with a random input stream based on the seed, with proper labels
12
Recurrent Neural NetworksSlide13
BPTT Learning/Execution
Consider Dparity(0,1) and Dparity(0,2)
For Dparity(0,1) what would k need to be?
For learning and execution we need to start the input stream at least k steps back to get reasonable contextHow do you fill in initial activations of context nodes
0 vector common, .5 vector, typical/average vector
For
Dparity
(0,2) what would
k
need to be?
Note
k
=1 is just standard non-feedback BP
And
k
=2 is simple Elman training looking back one step
Let's do an example and walk through it – HW
Recurrent Neural Networks
13Slide14
BPTT Training Example/Notes
How to select instances from the training set
Random start positions
Input and process for k steps (could start a few further back to get more representative example of initial context node activations – Burn in)
Use the
k
th label as the target
Any advantage in starting next sequence at the last start + 1?
Would already have approximations for initial context activations
Don't shuffle training set (targets of first
k
-1 instances are ignored)
Unfold and propagate error for the
k
layers
Backpropagate error just starting from the
k
th target – else hidden node weight updates would be dominated by earlier less attenuated target errors
Accumulate the weight changes and make one update at the end with the average – thus all unfolded weights are proper exact copies
14
Recurrent Neural NetworksSlide15
BPTT Issues/Notes
Typically an exponential drop-off in effect of prior inputs – only so much that a few context nodes can be expected to remember
Error attenuation issues of multi-layer BP learning as k
gets larger (will discuss vanishing gradient more later)Can use all recent deep learning tricks for that: ReLu, etc.
Learning less stable and more difficult to get good results, local optima more common with recurrent nets
BPTT – Common approach, finding proper depth
k
is important
Recurrent Neural Networks
15Slide16
Former BPTT Project
Implement BPTTExperiment with the Delayed Parity TaskFirst test with
Dparity(0,1) to make sure that works. Then try other variations including ones which stress BPTT.Analyze the results of learning a Real World recurrent task of your choice
16Recurrent Neural NetworksSlide17
BPTT Project
Sequential/Time series with and without separate labelsThese series often do not have separate labels
Recurrent nets can support both variationsPossibilities in the Irvine Data RepositoryDetailed Example – Localization Data for Person Activity
Data Set – Let's set this one up exactly – some subtletiesWhich features should we use as inputsRecurrent Neural Networks
17Slide18
Localization Example
Time stamps are not that regular
Thus just one sensor reading per time stampCould try to separate out learning of one sensor at a time, but the combination of sensors is critical, and just keeping the examples in temporal order should be sufficient for learning
What would the network structure and data representation look like?What value for k? Typical CV graph?Stopping criteria (e.g. validation set, etc.)Remember basic BP issues: normalization, nominal value encoding, don’t know values, etc.
Recurrent Neural Networks
18Slide19
Localization Example
Note that you might think that there would be a synchronized time stamp showing the
x,y
,z coordinates for each of the 4 sensors – in which case the feature vector would look like what?And could then do k ≈ 3 vs
k
≈ 10 for current version (and
k
≈ 10 will struggle due to error attenuation with vanilla RNN)
Recurrent Neural Networks
19Slide20
BPTT Project Hints
Dparity(0,1)
Needs LOTS of weights updates (e.g. 20 epochs with 10,000 instances, 10 epochs with 20,000 instance, 1 epoch with 106 etc.)Learning can be negligible for a long time, and then suddenly rise to 100% accuracyk
must be at least 2 Larger k should just slow things down and could lead to overfit if there were noise in the training data, shouldn’t for DP, but could add noiseNeed enough hidden nodesStruggles to learn with less than 4 unless lots of data, 4 or more does wellmore hidden nodes can bring down epochs, but may still increase wall clock time (i.e. # of weight updates)Not all hidden nodes need to be state nodes
Explore a bit
Recurrent Neural Networks
20Slide21
BPTT Project Hints
Dparity(x,y
,z)Will get 100% accuracy, more weight updates neededFor example DP (0,2,3) 16-32 hidden nodes, 10
6 training samples in data set (1 epoch), but much less can also work, z can be largerk must be at least z+1, try different valuesBurn-in helpful? – not necessary in DparityNeed enough hidden nodes, more can be helpful, but too many can slow things downLR around .5 seems to work well
Momentum (e.g. .9) also speeds things up
Use a fast computer language/system!
Recurrent Neural Networks
21Slide22
BPTT Project Hints
Real world taskUnlike Dparity(), the recurrence requirement for different instances may vary
Sometimes may need to look back 4-5 stepsOther times may not need to look back at allThus, first train with k=1 (standard BP) as the baseline, and then you can see how much improvement is obtained when using recurrence
Then try k = 2, 3, 4, etc.Too big of k (e.g. > 10, will usually take too long to see any benefits since the error is too attenuated to gain much benefit)Recurrent Neural Networks
22Slide23
Dealing with the vanishing/exploding gradient in RNNs
Gradient clipping – for large gradients – type of adaptive LRLinear self connection near one for gradient – Leaky unit
Skip connectionsMake sure can be influenced by units d skips back, still limited by amount of skipping, etc.Time delays and different time scales
LSTM – Long short term memory - Current state of the artGRU - Gated recurrent network – LSTM variantKeeps self loop to maintain state and gradient constant as long as needed – self loop is gated by another learning node - forget gateLearns when to use and forget the stateBrief Peek here, We’ll talk more about LSTM with deep networks
Recurrent Neural Networks
23Slide24
LSTM/GRU Peek Ahead
Long Short-Term Memory/Gated Recurrent Unit
Pictures from Olah's Blog
We have been adding a layer of weights between ht and otRecurrent Neural Networks
24Slide25
Trained with BPTT, but since handles long term attenuation issues,
k can be much largerSentence length, utterance size, 100 or some arbitrary chunk value, etc.
25Slide26
Other Recurrent Approaches
LSTM –
(
GRU is LSTM subset) – State of the art, look closer laterRNN node is basically LSTM without forget, ignore, and output gates, just g
Train with BPTT but bigger
k
’s (a full sequence if not too large), or some pretty big chunk (25-100) since we avoid vanishing gradient.
RTRL
– Real
Time Recurrent Learning
Do not have to specify a
k
, will look arbitrarily far back
But note, that with an expectation of looking arbitrarily far back, you create a very difficult problem expectation
Looking back more requires increase in data, else overfit – Lots of irrelevant options which could lead to minor accuracy improvements
Have reasonable expectations
n
4
and
n
3
versions –expensive and not used much in practice
Recursive Network –
Dynamic tree structures
Reservoir computing: Echo State Networks and Liquid State machines
Neural Turing Machine – RNN which can learn to read/write memory
Relaxation networks – Hopfield, Boltzmann,
Multcons
,
etc.
26
Recurrent Neural Networks