Recurrent Neural Networks - PowerPoint Presentation

581 views
Uploaded On 2016-06-29

Recurrent Neural Networks - PPT Presentation

1 Recurrent Networks Some problems require previous historycontext in order to be able to give proper output speech recognition stock forecasting target tracking etc One way to do that is to just provide all the necessary context in one snapshot and use standard learning ID: 383003

networks recurrent time neural recurrent networks neural time input output bptt dparity context hidden learning network training standard set

Link:

Copy

Embed:

<iframe width="560" height="315" src="https://www.docslides.com/embed/383003" frameborder="0" allowfullscreen></iframe>

Download Presentation from below link

Download Presentation The PPT/PDF document "Recurrent Neural Networks" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.

Presentation Transcript

Slide1

Recurrent Neural Networks

Recurrent Networks

Some problems require previous history/context in order to be able to give proper output (speech recognition, stock forecasting, target tracking, etc.One way to do that is to just provide all the necessary context in one "snap-shot" and use standard learningHow big should the snap-shot be? Varies for different instances of the problem.Slide2

Recurrent Networks

Another option is to use a recurrent neural network which lets the network dynamically learn how much context it needs in order to solve the problemSpeech Example – Vowels vs Consonants, etc.

Acts like a state machine that will give different outputs for the current input depending on the current stateRecurrent nets must learn and use this state/context information in order to get high accuracy on the taskTemporal Deep Network

2Recurrent Neural NetworksSlide3

Recurrent Training Data

Time SeriesCurrent target dependent on some combination of current and past inputs

Recurrent Neural Networks

time

Target

7Slide4

Recurrent Networks

Partially and Fully recurrent networks – feed forward vs. relaxation nets

Parameter sharing and arbitrary length stream – see figure

How to train?Elman Training – simple recurrent networks, can use standard BP trainingBPTT – Backpropagation through time – can learn further back, must pick depth

Real-time recurrent learning, LSTM, etc.

Recurrent Neural NetworksSlide5

Recurrent Neural NetworksSlide6

Recurrent Network Variations

This network can theoretically learn contexts arbitrarily far backMany structural variationsElman/Simple Net

Jordan NetMixedContext sub-blocks, etc.Multiple hidden/context layers, etc.How do we learn the weights?

Recurrent Neural Networks6Slide7

Simple Recurrent Training – Elman Training

Can think of net as just being a normal MLP structure where part of the input happens to be a copy of the last set of state/hidden node activations. The MLP itself does not even need to be aware that the context inputs are coming from the hidden layer

Then can train with standard BP trainingWhile network can theoretically look back arbitrarily far in time, Elman learning gradient goes back only 1 step in time, thus limited in the context it can learn

Would if current output depended on input 2 time steps backCan still be useful for applications with short term dependenciesRecurrent Neural Networks

7Slide8

BPTT –

Backprop Through Time

BPTT allows us to look back further as we train

However we have to pre-specify a value k, which is the maximum that learning will look backDuring training we unfold the network in time as if it were a standard feedforward network with k layersBut where the weights of each unfolded layer are the sameWe then train the unfolded k layer feedforward net with standard BP (or deep net variations)

Execution still happens with the actual recurrent version

Is not knowing

apriori

that bad? How do you choose it?

Cross Validation, just like finding best number of hidden nodes, etc., thus we can find a good

fairly reasonably for a given task

But problematic if the amount of state needed varies a lot

Recurrent Neural Networks

8Slide9

is the number of feedback/context blocks in the unfolded net.

Note k=1 is just standard MLP with no feedback1st block h(0) activations are just initialized to a constant or 0 so

=1 is still same as standard MLP, so just leave it out for feedforward MLP

Last context block is

(

-1)

=2 is Elman trainingSlide10

BPTT - Unfolding in Time (

=3) with output connections

Weights at each layer are maintained as exact copies

Input

Output

Input

Output

Input

Output

Input

Output

one step

time delay

Recurrent Neural Networks

Hidden

Context

Hidden

Context

Hidden

Context

Hidden

Context

Initial ContextSlide11

BPTT - Unfolding in Time (

=3) with output connections

Weights at each layer are maintained as exact copies

Input

Output

Input

Output

Input

Output

Input

Output

one step

time delay

one step

time delay

Recurrent Neural NetworksSlide12

Synthetic Data Set

Delayed Parity Task - DparityThis task has a single time series input of random bits. The output label is the parity (even) of

n arbitrarily delayed (but consistent) previous inputs. For example, for Dparity(0,2,5) the label of each instance would be set to the parity of the current input, the input 2 steps back, and the input 5 steps back.

Dparity(0,1) is the simplest version where the output is the XOR of the current input and the most recent input Dparity-to-ARFF appUser enters # of instances wanted, random seed, and a vector of the n delays (and optionally a noise parameter?)The app returns an ARFF file of this task with a random input stream based on the seed, with proper labels

Recurrent Neural NetworksSlide13

BPTT Learning/Execution

Consider Dparity(0,1) and Dparity(0,2)

For Dparity(0,1) what would k need to be?

For learning and execution we need to start the input stream at least k steps back to get reasonable contextHow do you fill in initial activations of context nodes

0 vector common, .5 vector, typical/average vector

For

Dparity

(0,2) what would

need to be?

Note

=1 is just standard non-feedback BP

And

=2 is simple Elman training looking back one step

Let's do an example and walk through it – HW

Recurrent Neural Networks

13Slide14

BPTT Training Example/Notes

How to select instances from the training set

Random start positions

Input and process for k steps (could start a few further back to get more representative example of initial context node activations – Burn in)

Use the

th label as the target

Any advantage in starting next sequence at the last start + 1?

Would already have approximations for initial context activations

Don't shuffle training set (targets of first

-1 instances are ignored)

Unfold and propagate error for the

layers

Backpropagate error just starting from the

th target – else hidden node weight updates would be dominated by earlier less attenuated target errors

Accumulate the weight changes and make one update at the end with the average – thus all unfolded weights are proper exact copies

Recurrent Neural NetworksSlide15

BPTT Issues/Notes

Typically an exponential drop-off in effect of prior inputs – only so much that a few context nodes can be expected to remember

Error attenuation issues of multi-layer BP learning as k

gets larger (will discuss vanishing gradient more later)Can use all recent deep learning tricks for that: ReLu, etc.

Learning less stable and more difficult to get good results, local optima more common with recurrent nets

BPTT – Common approach, finding proper depth

is important

Recurrent Neural Networks

15Slide16

Former BPTT Project

Implement BPTTExperiment with the Delayed Parity TaskFirst test with

Dparity(0,1) to make sure that works. Then try other variations including ones which stress BPTT.Analyze the results of learning a Real World recurrent task of your choice

16Recurrent Neural NetworksSlide17

BPTT Project

Sequential/Time series with and without separate labelsThese series often do not have separate labels

Recurrent nets can support both variationsPossibilities in the Irvine Data RepositoryDetailed Example – Localization Data for Person Activity

Data Set – Let's set this one up exactly – some subtletiesWhich features should we use as inputsRecurrent Neural Networks

17Slide18

Localization Example

Time stamps are not that regular

Thus just one sensor reading per time stampCould try to separate out learning of one sensor at a time, but the combination of sensors is critical, and just keeping the examples in temporal order should be sufficient for learning

What would the network structure and data representation look like?What value for k? Typical CV graph?Stopping criteria (e.g. validation set, etc.)Remember basic BP issues: normalization, nominal value encoding, don’t know values, etc.

Recurrent Neural Networks

18Slide19

Localization Example

Note that you might think that there would be a synchronized time stamp showing the

x,y

,z coordinates for each of the 4 sensors – in which case the feature vector would look like what?And could then do k ≈ 3 vs

≈ 10 for current version (and

≈ 10 will struggle due to error attenuation with vanilla RNN)

Recurrent Neural Networks

19Slide20

BPTT Project Hints

Dparity(0,1)

Needs LOTS of weights updates (e.g. 20 epochs with 10,000 instances, 10 epochs with 20,000 instance, 1 epoch with 106 etc.)Learning can be negligible for a long time, and then suddenly rise to 100% accuracyk

must be at least 2 Larger k should just slow things down and could lead to overfit if there were noise in the training data, shouldn’t for DP, but could add noiseNeed enough hidden nodesStruggles to learn with less than 4 unless lots of data, 4 or more does wellmore hidden nodes can bring down epochs, but may still increase wall clock time (i.e. # of weight updates)Not all hidden nodes need to be state nodes

Explore a bit

Recurrent Neural Networks

20Slide21

BPTT Project Hints

Dparity(x,y

,z)Will get 100% accuracy, more weight updates neededFor example DP (0,2,3) 16-32 hidden nodes, 10

6 training samples in data set (1 epoch), but much less can also work, z can be largerk must be at least z+1, try different valuesBurn-in helpful? – not necessary in DparityNeed enough hidden nodes, more can be helpful, but too many can slow things downLR around .5 seems to work well

Momentum (e.g. .9) also speeds things up

Use a fast computer language/system!

Recurrent Neural Networks

21Slide22

BPTT Project Hints

Real world taskUnlike Dparity(), the recurrence requirement for different instances may vary

Sometimes may need to look back 4-5 stepsOther times may not need to look back at allThus, first train with k=1 (standard BP) as the baseline, and then you can see how much improvement is obtained when using recurrence

Then try k = 2, 3, 4, etc.Too big of k (e.g. > 10, will usually take too long to see any benefits since the error is too attenuated to gain much benefit)Recurrent Neural Networks

22Slide23

Dealing with the vanishing/exploding gradient in RNNs

Gradient clipping – for large gradients – type of adaptive LRLinear self connection near one for gradient – Leaky unit

Skip connectionsMake sure can be influenced by units d skips back, still limited by amount of skipping, etc.Time delays and different time scales

LSTM – Long short term memory - Current state of the artGRU - Gated recurrent network – LSTM variantKeeps self loop to maintain state and gradient constant as long as needed – self loop is gated by another learning node - forget gateLearns when to use and forget the stateBrief Peek here, We’ll talk more about LSTM with deep networks

Recurrent Neural Networks

23Slide24

LSTM/GRU Peek Ahead

Long Short-Term Memory/Gated Recurrent Unit

Pictures from Olah's Blog

We have been adding a layer of weights between ht and otRecurrent Neural Networks

24Slide25

Trained with BPTT, but since handles long term attenuation issues,

k can be much largerSentence length, utterance size, 100 or some arbitrary chunk value, etc.

25Slide26

Other Recurrent Approaches

LSTM –