0:00

You've learned about how RNNs work and how they

Â can be applied to problems like name entity recognition,

Â as well as to language modeling,

Â and you saw how backpropagation can be used to train in RNN.

Â It turns out that one of the problems with

Â a basic RNN algorithm is that it runs into vanishing gradient problems.

Â Let's discuss that, and then in the next few videos,

Â we'll talk about some solutions that will help to address this problem.

Â So, you've seen pictures of RNNS that look like this.

Â And let's take a language modeling example.

Â Let's say you see this sentence,

Â "The cat which already ate and maybe already ate a bunch of food that was delicious dot,

Â dot, dot, dot, was full."

Â And so, to be consistent,

Â just because cat is singular,

Â it should be the cat was, were then was,

Â "The cats which already ate a bunch of food was delicious,

Â and apples, and pears,

Â and so on, were full."

Â So to be consistent,

Â it should be cat was or cats were.

Â And this is one example of when language can have very long-term dependencies,

Â where it worked at this much earlier can

Â affect what needs to come much later in the sentence.

Â But it turns out the basics RNN we've seen so far it's not

Â very good at capturing very long-term dependencies.

Â To explain why, you might remember from

Â our early discussions of training very deep neural networks,

Â that we talked about the vanishing gradients problem.

Â So this is a very, very deep neural network say,

Â 100 layers or even much deeper than you would carry out forward prop,

Â from left to right and then back prop.

Â And we said that, if this is a very deep neural network,

Â then the gradient from just output y,

Â would have a very hard time propagating back to

Â affect the weights of these earlier layers,

Â to affect the computations in the earlier layers.

Â And for an RNN with a similar problem,

Â you have forward prop came from left to right,

Â and then back prop,

Â going from right to left.

Â And it can be quite difficult,

Â because of the same vanishing gradients problem,

Â for the outputs of the errors associated with

Â the later time steps to affect the computations that are earlier.

Â And so in practice, what this means is,

Â it might be difficult to get a neural network to realize that it needs

Â to memorize the just see a singular noun or a plural noun,

Â so that later on in the sequence that can generate either was or were,

Â depending on whether it was singular or plural.

Â And notice that in English,

Â this stuff in the middle could be arbitrarily long, right?

Â So you might need to memorize the singular/plural for

Â a very long time before you get to use that bit of information.

Â So because of this problem,

Â the basic RNN model has many local influences,

Â meaning that the output y^<3> is mainly influenced by values close to y^<3>.

Â And a value here is mainly influenced by inputs that are somewhere close.

Â And it's difficult for the output here to be strongly

Â influenced by an input that was very early in the sequence.

Â And this is because whatever the output is,

Â whether this got it right, this got it wrong,

Â it's just very difficult for the area to

Â backpropagate all the way to the beginning of the sequence,

Â and therefore to modify how the neural network

Â is doing computations earlier in the sequence.

Â So this is a weakness of the basic RNN algorithm.

Â One, which was not addressed in the next few videos.

Â But if we don't address it, then RNNs

Â tend not to be very good at capturing long-range dependencies.

Â And even though this discussion has focused on vanishing gradients,

Â you will remember when we talked about very deep neural networks,

Â that we also talked about exploding gradients.

Â We're doing back prop,

Â the gradients should not just decrease exponentially,

Â they may also increase exponentially with the number of layers you go through.

Â It turns out that vanishing gradients tends to be the bigger problem with training RNNs,

Â although when exploding gradients happens,

Â it can be catastrophic because

Â the exponentially large gradients can cause your parameters

Â to become so large that your neural network parameters get really messed up.

Â So it turns out that exploding gradients are easier to spot because

Â the parameters just blow up and you might often see NaNs,

Â or not a numbers,

Â meaning results of a numerical overflow in your neural network computation.

Â And if you do see exploding gradients,

Â one solution to that is apply gradient clipping.

Â And what that really means,

Â all that means is look at your gradient vectors,

Â and if it is bigger than some threshold,

Â re-scale some of your gradient vector so that is not too big.

Â So there are clips according to some maximum value.

Â So if you see exploding gradients,

Â if your derivatives do explode or you see NaNs,

Â just apply gradient clipping,

Â and that's a relatively robust solution that will take care of exploding gradients.

Â But vanishing gradients is much harder to solve

Â and it will be the subject of the next few videos.

Â So to summarize, in an earlier course,

Â you saw how the training of very deep neural network,

Â you can run into a vanishing gradient or exploding gradient problems with the derivative,

Â either decreases exponentially or grows

Â exponentially as a function of the number of layers.

Â And in RNN, say in RNN processing data over a thousand times sets,

Â over 10,000 times sets,

Â that's basically a 1,000 layer or they go 10,000 layer neural network,

Â and so, it too runs into these types of problems.

Â Exploding gradients, you could sort of address by just using gradient clipping,

Â but vanishing gradients will take more work to address.

Â So what we do in the next video is talk about GRU,

Â the greater recurrent units,

Â which is a very effective solution for addressing

Â the vanishing gradient problem and will allow

Â your neural network to capture much longer range dependencies.

Â So, lets go on to the next video.

Â