0:00

In the last video, you saw the notation we used to define sequence learning problems.

Now, let's talk about how you can build a model,

build a neural network to drawing the mapping from X to Y.

Now, one thing you could do is try to use a standard neural network for this task.

So in our previous example,

we had nine input words.

So you could imagine trying to take these nine input words,

maybe the nine one heart vectors and feeding them into a standard neural network,

maybe a few hidden layers and then eventually,

have this output the nine values zero or

one that tell you whether each word is part of a person's name.

But this turns out not to work well,

and there are really two main problems with this.

The first is that the inputs and outputs can be different lengths in different examples.

So it's not as if every single example has

the same input length TX or the same output length TY.

And maybe if every sentence had a maximum length,

maybe you could pad,

or zero pad every input up to that maximum length,

but this still doesn't seem like a good representation.

And in a second, it might be more serious problem is

that a naive neural network architecture like this,

it doesn't share features learned across different positions of techs.

In particular, if the neural network has learned that maybe the word heavy

appearing in position one gives a sign that that is part of a person's name,

then one would be nice if it automatically figures out

that heavy appearing in some other position,

XT also means that that might be a person's name.

And this is maybe similar to what you saw in convolutional neural networks where you

want things learned for one part of

the image to generalize quickly to other parts of the image,

and we'd like similar effect for sequence data as well.

And similar to what you saw with confenets using

a better representation will also let you reduce the number of parameters in your model.

So previously, we said that each of these is a 10,000 dimensional one vector.

And so, this is just a very large input layer.

If the total input size was maximum number of words times 10,000,

and the weight matrix of

this first layer would end up having an enormous number of parameters.

So a recurrent neural network which will start to describe in the next slide,

does not have either of these disadvantages.

So what is a recurrent neural network?

Let's build one out.

So if you are reading the sentence from left to right,

the first word you read is the some first where say X1.

What we're going to do is take the first word and feed it into a neural network layer.

I'm going to draw it like this.

So that's a hidden layer of the first neural network.

And look at how the neural network maybe try to predict the output.

So is this part of a person's name or not?

And what a recurrent neural network does is when it

then goes on to read the second word in a sentence,

say X2, instead of just predicting Y2 using only X2,

it also gets to input some information from whether a computer that time-step ones.

So in particular, the activation value from time-step one is passed on to time-step 2.

And then, at the next time-step,

a recurrent neural network inputs the third word X3,

and it tries to predict,

output some prediction y-hat 3, and so on,

up until the last time-step where inputs XTx,

and then it outputs Y hat TY.

In this example, Tx=Ty,

and the architecture will change a bit if Tx and Ty are not identical.

And so, at each time-step,

the recurrent neural network passes on

this activation to the next time-step for it to use.

And to kick off the whole thing,

we'll also have some made up activation at time zero.

This is usually the vector of zeroes.

Some researchers will initialize a zero randomly have other ways

to initialize a zero but really having a vector zero is just a fake.

Time Zero activation is the most common choice.

And so that does input into the neural network.

In some research papers or in some books,

you see this type of neural network drawn with

the following diagram in which every time-step,

you input X and output Y hat,

maybe sometimes there will be a T index there,

and then to denote the recurrent connection,

sometimes people will draw a loop like that,

that the layer feeds back to itself.

Sometimes they'll draw a shaded box to denote that this is the shaded box here,

denotes a time delay of one step.

I personally find these recurrent diagrams much harder to interpret.

And so throughout this course,

I will tend to draw the on the road diagram like the one you have on the left.

But if you see something like the diagram on

the right in a textbook or in a research paper,

what it really means, or the way I tend to think about it is the

mentally unrolled into the diagram you have on the left hand side.

The recurrent neural network scans through the data from left to right.

And the parameters it uses for each time step are shared.

So there will be a set of parameters which

we'll describe in greater detail on the next slide,

but the parameters governing the connection from X1 to

the hidden layer will be some set of the parameters we're going to write as WAX,

and it's the same parameters WAX that it

uses for every time-step I guess you could write WAX there as well.

And the activations, the horizontal connections,

will be governed by some set of parameters WAA,

and is the same parameters WAA use on every time-step,

and similarly, the sum WYA that governs the output predictions.

And I'll describe in the next slide exactly how these parameters work.

So in this recurrent neural network,

what this means is that we're making the prediction for Y3

against the information not only from X3,

but also the information from X1 and X2,

because the information of X1 can pass through this way to help the prediction with Y3.

Now one weakness of this RNN is that it only uses

the information that is earlier in the sequence to make a prediction,

in particular, when predicting Y3,

it doesn't use information about the words X4,

X5, X6 and so on.

And so this is a problem because if you're given a sentence,

he said, "Teddy Roosevelt was a great president."

In order to decide whether or not the word Teddy is part of a person's name,

it be really useful to know not just information from

the first two words but to know information from the later words in the sentence as well,

because the sentence could also happen,

he said, "Teddy bears are on sale!"

And so, given just the first three words,

it's not possible to know for sure whether the word Teddy is part of a person's name.

In the first example, it is,

in the second example, is not,

but you can't tell the difference if you look only at the first three words.

So one limitation of

this particular neural network structure is that the prediction at a certain time

uses inputs or uses information from the inputs

earlier in the sequence but not information later in the sequence.

We will address this in a later video where we talk about

a bidirectional recurrent neural networks or BRNNs.

But for now,

this simpler uni-directional neural network architecture

will suffice for us to explain the key concepts.

And we just have to make a quick modifications in these ideas

later to enable say the prediction of Y-hat 3

to use both information earlier in

the sequence as well as information later in the sequence,

but we'll get to that in a later video.

So let's not write to explicitly what are the calculations that this neural network does.

Here's a cleaned out version of the picture of the neural network.

As I mentioned previously, typically,

you started off with the input a0 equals the vector of all zeroes.

Next. This is what a forward propagation looks like.

To compute a1, you would compute that as an activation function g,

applied to Waa times

a0 plus W a x times x1 plus a bias was going to write it as ba,

and then to compute y hat 1 the prediction of times that one,

that will be some activation function,

maybe a different activation function,

than the one above.

But apply to WYA times a1 plus b y.

And the notation convention I'm going to use for the sub

zero of these matrices like that example, W a x.

The second index means that this W a x is going to be multiplied by some x like quantity,

and this means that this is used to compute some a like quantity.

Like like so. And similarly,

you notice that here WYA is multiplied by a sum

a like quantity to compute a y type quantity.

The activation function used in- to compute

the activations will often be a tonnage and the choice of an RNN and sometimes,

values are also used although the tonnage is actually a pretty common choice.

And we have other ways of preventing

the vanishing gradient problem which we'll talk about later this week.

And depending on what your output y is,

if it is a binary classification problem,

then I guess you would use a sigmoid activation function

or it could be a soft Max if you have a ky classification problem.

But the choice of activation function here would

depend on what type of output y you have.

So, for the name entity recognition task,

where Y was either zero or one.

I guess the second g could be a signal and activation function.

And I guess you could write g2 if you want to distinguish that this

is these could be different activation functions but I usually won't do that.

And then, more generally at time t,

a t will be g of W a a times a,

from the previous time-step,

plus W a x of x from the current time-step plus B a,

and y hat t is equal to g, again,

it could be different activation functions but g of WYA times a t plus B y.

So, these equations define for propagation in the neural network.

Where you would start off with a zeroes [inaudible] and then using a zero and X1,

you will compute a1 and y hat one, and then you,

take X2 and use X2 and A1 to compute A2 and Y hat two and so on,

and you carry out for propagation going from the left to the right of this picture.

Now, in order to help us develop the more complex neural networks,

I'm actually going to take this notation and simplify it a little bit.

So, let me copy these two equations in the next slide.

Right. Here they are, and what I'm going to do

is actually take- so to simplify the notation a bit,

I'm actually going to take that and write in a slightly simpler way.

And someone very does this a<t> = g times

just a matrix W a times a new quantity is going to

be a<t> minus one comma

x<t> and then, plus B a.

And so, that underlining quantity on the left and right are supposed to be equivalent.

So, the way we define W a is we'll take this matrix W a a and this matrix W a x.

And put them side by side and stack them horizontally as follows.

And this will be the matrix W a.

So for example, if a was a hundred dimensional,

and then another example,

X was 10,000 dimensional,

then W a a would have been a 100 by 100 dimensional matrix

and W a x would have been a 100 by 10,000 dimensional matrix.

And so stacking these two matrices together this would be 100 dimensional.

This would be 100, and this would be I guess 10,000 elements.

So W a will be a 100 by one zero one zero zero zero dimensional matrix.

I guess this diagram on the left is not drawn to scale.

Since W a x would be a very wide matrix.

And what this notation means,

is to just take the two vectors,

and stack them together.

So, let me use that notation to denote that

we're going to take the vector a<t> minus one.

So there's a 100 dimensional and stack it on top of a t.

So this ends up being a one zero one zero zero dimensional vector.

And so hopefully, you check for yourself that this matrix times this vector,

just gives you back to the original quantity.

Right. Because now, this matrix W a a times W

a x multiplied by this a<t minus 1> x<t> vector,

this is just equal to W a a times a<t minus 1> plus W

a x times x t which is exactly what we had back over here.

So, the advantages of this notation is that rather than carrying

around two parameter matrices W a a and W a x,

we can compress them into just one parameter matrix W a.

And this will simplify a notation for when we develop more complex models.

And then, for this, in a similar way I'm just going to rewrite this

slightly with the ranges as W y a t plus b y.

And now, we just have the substrates in the notation W y and b y,

it denotes what type of output quantity over computing.

So WY indicates that there's a weight matrix of computing a y like

quantity and here a Wa and ba on top.

In the case of those the parameters of computing that an

a and activation output quantity. So, that's it.

You now know, what is a basic recurrent network.

Next, let's talk about back propagation and how you learn with these RNNs.