0:00

In the last video, you saw the notation we used to define sequence learning problems.

Â Now, let's talk about how you can build a model,

Â build a neural network to drawing the mapping from X to Y.

Â Now, one thing you could do is try to use a standard neural network for this task.

Â So in our previous example,

Â we had nine input words.

Â So you could imagine trying to take these nine input words,

Â maybe the nine one heart vectors and feeding them into a standard neural network,

Â maybe a few hidden layers and then eventually,

Â have this output the nine values zero or

Â one that tell you whether each word is part of a person's name.

Â But this turns out not to work well,

Â and there are really two main problems with this.

Â The first is that the inputs and outputs can be different lengths in different examples.

Â So it's not as if every single example has

Â the same input length TX or the same output length TY.

Â And maybe if every sentence had a maximum length,

Â maybe you could pad,

Â or zero pad every input up to that maximum length,

Â but this still doesn't seem like a good representation.

Â And in a second, it might be more serious problem is

Â that a naive neural network architecture like this,

Â it doesn't share features learned across different positions of techs.

Â In particular, if the neural network has learned that maybe the word heavy

Â appearing in position one gives a sign that that is part of a person's name,

Â then one would be nice if it automatically figures out

Â that heavy appearing in some other position,

Â XT also means that that might be a person's name.

Â And this is maybe similar to what you saw in convolutional neural networks where you

Â want things learned for one part of

Â the image to generalize quickly to other parts of the image,

Â and we'd like similar effect for sequence data as well.

Â And similar to what you saw with confenets using

Â a better representation will also let you reduce the number of parameters in your model.

Â So previously, we said that each of these is a 10,000 dimensional one vector.

Â And so, this is just a very large input layer.

Â If the total input size was maximum number of words times 10,000,

Â and the weight matrix of

Â this first layer would end up having an enormous number of parameters.

Â So a recurrent neural network which will start to describe in the next slide,

Â does not have either of these disadvantages.

Â So what is a recurrent neural network?

Â Let's build one out.

Â So if you are reading the sentence from left to right,

Â the first word you read is the some first where say X1.

Â What we're going to do is take the first word and feed it into a neural network layer.

Â I'm going to draw it like this.

Â So that's a hidden layer of the first neural network.

Â And look at how the neural network maybe try to predict the output.

Â So is this part of a person's name or not?

Â And what a recurrent neural network does is when it

Â then goes on to read the second word in a sentence,

Â say X2, instead of just predicting Y2 using only X2,

Â it also gets to input some information from whether a computer that time-step ones.

Â So in particular, the activation value from time-step one is passed on to time-step 2.

Â And then, at the next time-step,

Â a recurrent neural network inputs the third word X3,

Â and it tries to predict,

Â output some prediction y-hat 3, and so on,

Â up until the last time-step where inputs XTx,

Â and then it outputs Y hat TY.

Â In this example, Tx=Ty,

Â and the architecture will change a bit if Tx and Ty are not identical.

Â And so, at each time-step,

Â the recurrent neural network passes on

Â this activation to the next time-step for it to use.

Â And to kick off the whole thing,

Â we'll also have some made up activation at time zero.

Â This is usually the vector of zeroes.

Â Some researchers will initialize a zero randomly have other ways

Â to initialize a zero but really having a vector zero is just a fake.

Â Time Zero activation is the most common choice.

Â And so that does input into the neural network.

Â In some research papers or in some books,

Â you see this type of neural network drawn with

Â the following diagram in which every time-step,

Â you input X and output Y hat,

Â maybe sometimes there will be a T index there,

Â and then to denote the recurrent connection,

Â sometimes people will draw a loop like that,

Â that the layer feeds back to itself.

Â Sometimes they'll draw a shaded box to denote that this is the shaded box here,

Â denotes a time delay of one step.

Â I personally find these recurrent diagrams much harder to interpret.

Â And so throughout this course,

Â I will tend to draw the on the road diagram like the one you have on the left.

Â But if you see something like the diagram on

Â the right in a textbook or in a research paper,

Â what it really means, or the way I tend to think about it is the

Â mentally unrolled into the diagram you have on the left hand side.

Â The recurrent neural network scans through the data from left to right.

Â And the parameters it uses for each time step are shared.

Â So there will be a set of parameters which

Â we'll describe in greater detail on the next slide,

Â but the parameters governing the connection from X1 to

Â the hidden layer will be some set of the parameters we're going to write as WAX,

Â and it's the same parameters WAX that it

Â uses for every time-step I guess you could write WAX there as well.

Â And the activations, the horizontal connections,

Â will be governed by some set of parameters WAA,

Â and is the same parameters WAA use on every time-step,

Â and similarly, the sum WYA that governs the output predictions.

Â And I'll describe in the next slide exactly how these parameters work.

Â So in this recurrent neural network,

Â what this means is that we're making the prediction for Y3

Â against the information not only from X3,

Â but also the information from X1 and X2,

Â because the information of X1 can pass through this way to help the prediction with Y3.

Â Now one weakness of this RNN is that it only uses

Â the information that is earlier in the sequence to make a prediction,

Â in particular, when predicting Y3,

Â it doesn't use information about the words X4,

Â X5, X6 and so on.

Â And so this is a problem because if you're given a sentence,

Â he said, "Teddy Roosevelt was a great president."

Â In order to decide whether or not the word Teddy is part of a person's name,

Â it be really useful to know not just information from

Â the first two words but to know information from the later words in the sentence as well,

Â because the sentence could also happen,

Â he said, "Teddy bears are on sale!"

Â And so, given just the first three words,

Â it's not possible to know for sure whether the word Teddy is part of a person's name.

Â In the first example, it is,

Â in the second example, is not,

Â but you can't tell the difference if you look only at the first three words.

Â So one limitation of

Â this particular neural network structure is that the prediction at a certain time

Â uses inputs or uses information from the inputs

Â earlier in the sequence but not information later in the sequence.

Â We will address this in a later video where we talk about

Â a bidirectional recurrent neural networks or BRNNs.

Â But for now,

Â this simpler uni-directional neural network architecture

Â will suffice for us to explain the key concepts.

Â And we just have to make a quick modifications in these ideas

Â later to enable say the prediction of Y-hat 3

Â to use both information earlier in

Â the sequence as well as information later in the sequence,

Â but we'll get to that in a later video.

Â So let's not write to explicitly what are the calculations that this neural network does.

Â Here's a cleaned out version of the picture of the neural network.

Â As I mentioned previously, typically,

Â you started off with the input a0 equals the vector of all zeroes.

Â Next. This is what a forward propagation looks like.

Â To compute a1, you would compute that as an activation function g,

Â applied to Waa times

Â a0 plus W a x times x1 plus a bias was going to write it as ba,

Â and then to compute y hat 1 the prediction of times that one,

Â that will be some activation function,

Â maybe a different activation function,

Â than the one above.

Â But apply to WYA times a1 plus b y.

Â And the notation convention I'm going to use for the sub

Â zero of these matrices like that example, W a x.

Â The second index means that this W a x is going to be multiplied by some x like quantity,

Â and this means that this is used to compute some a like quantity.

Â Like like so. And similarly,

Â you notice that here WYA is multiplied by a sum

Â a like quantity to compute a y type quantity.

Â The activation function used in- to compute

Â the activations will often be a tonnage and the choice of an RNN and sometimes,

Â values are also used although the tonnage is actually a pretty common choice.

Â And we have other ways of preventing

Â the vanishing gradient problem which we'll talk about later this week.

Â And depending on what your output y is,

Â if it is a binary classification problem,

Â then I guess you would use a sigmoid activation function

Â or it could be a soft Max if you have a ky classification problem.

Â But the choice of activation function here would

Â depend on what type of output y you have.

Â So, for the name entity recognition task,

Â where Y was either zero or one.

Â I guess the second g could be a signal and activation function.

Â And I guess you could write g2 if you want to distinguish that this

Â is these could be different activation functions but I usually won't do that.

Â And then, more generally at time t,

Â a t will be g of W a a times a,

Â from the previous time-step,

Â plus W a x of x from the current time-step plus B a,

Â and y hat t is equal to g, again,

Â it could be different activation functions but g of WYA times a t plus B y.

Â So, these equations define for propagation in the neural network.

Â Where you would start off with a zeroes [inaudible] and then using a zero and X1,

Â you will compute a1 and y hat one, and then you,

Â take X2 and use X2 and A1 to compute A2 and Y hat two and so on,

Â and you carry out for propagation going from the left to the right of this picture.

Â Now, in order to help us develop the more complex neural networks,

Â I'm actually going to take this notation and simplify it a little bit.

Â So, let me copy these two equations in the next slide.

Â Right. Here they are, and what I'm going to do

Â is actually take- so to simplify the notation a bit,

Â I'm actually going to take that and write in a slightly simpler way.

Â And someone very does this a<t> = g times

Â just a matrix W a times a new quantity is going to

Â be a<t> minus one comma

Â x<t> and then, plus B a.

Â And so, that underlining quantity on the left and right are supposed to be equivalent.

Â So, the way we define W a is we'll take this matrix W a a and this matrix W a x.

Â And put them side by side and stack them horizontally as follows.

Â And this will be the matrix W a.

Â So for example, if a was a hundred dimensional,

Â and then another example,

Â X was 10,000 dimensional,

Â then W a a would have been a 100 by 100 dimensional matrix

Â and W a x would have been a 100 by 10,000 dimensional matrix.

Â And so stacking these two matrices together this would be 100 dimensional.

Â This would be 100, and this would be I guess 10,000 elements.

Â So W a will be a 100 by one zero one zero zero zero dimensional matrix.

Â I guess this diagram on the left is not drawn to scale.

Â Since W a x would be a very wide matrix.

Â And what this notation means,

Â is to just take the two vectors,

Â and stack them together.

Â So, let me use that notation to denote that

Â we're going to take the vector a<t> minus one.

Â So there's a 100 dimensional and stack it on top of a t.

Â So this ends up being a one zero one zero zero dimensional vector.

Â And so hopefully, you check for yourself that this matrix times this vector,

Â just gives you back to the original quantity.

Â Right. Because now, this matrix W a a times W

Â a x multiplied by this a<t minus 1> x<t> vector,

Â this is just equal to W a a times a<t minus 1> plus W

Â a x times x t which is exactly what we had back over here.

Â So, the advantages of this notation is that rather than carrying

Â around two parameter matrices W a a and W a x,

Â we can compress them into just one parameter matrix W a.

Â And this will simplify a notation for when we develop more complex models.

Â And then, for this, in a similar way I'm just going to rewrite this

Â slightly with the ranges as W y a t plus b y.

Â And now, we just have the substrates in the notation W y and b y,

Â it denotes what type of output quantity over computing.

Â So WY indicates that there's a weight matrix of computing a y like

Â quantity and here a Wa and ba on top.

Â In the case of those the parameters of computing that an

Â a and activation output quantity. So, that's it.

Â You now know, what is a basic recurrent network.

Â Next, let's talk about back propagation and how you learn with these RNNs.

Â