0:00

In the last video, you saw the notation we'll use to define sequence learning problems.

Now, let's talk about how you can build a model,

built a neural network to learn the mapping from x to y.

Now, one thing you could do is try to use a standard neural network for this task.

So, in our previous example,

we had nine input words.

So, you could imagine trying to take these nine input words,

maybe the nine one-hot vectors and feeding them into a standard neural network,

maybe a few hidden layers,

and then eventually had this output the nine values zero or

one that tell you whether each word is part of a person's name.

But this turns out not to work well and there are really two main problems of this the

first is that the inputs and outputs can be different lengths and different examples.

So, it's not as if every single example had the same input length Tx or

the same upper length Ty and maybe if every sentence has a maximum length.

Maybe you could pad or zero-pad every inputs up to

that maximum length but this still doesn't seem like a good representation.

And then a second and maybe more serious problem is

that a naive neural network architecture like this,

it doesn't share features learned across different positions of texts.

In particular of the neural network has learned that maybe the word

heavy appearing in position one gives a sign that that's part of a person's name,

then wouldn't it be nice if it automatically figures out that heavy appearing in

some other position x t also means that that might be a person's name.

And this is maybe similar to what you saw in convolutional neural networks where

you want things learned for one part of

the image to generalize quickly to other parts of the image,

and we like a similar effects for sequence data as well.

And similar to what you saw with confidence using

a better representation will also let you reduce the number of parameters in your model.

So previously, we said that each of these is a 10,000

dimensional one-hot vector and so this is

just a very large input layer if

the total input size was maximum number of words times 10,000.

A weight matrix of this first layer will end up having an enormous number of parameters.

So a recurrent neural network which we'll start to describe in

the next slide does not have either of these disadvantages.

So, what is a recurrent neural network?

Let's build one up.

So if you are reading the sentence from left to right,

the first word you will read is the some first words say X1,

and what we're going to do is take the first word

and feed it into a neural network layer.

I'm going to draw it like this.

So there's a hidden layer of the first neural network and we

can have the neural network maybe try to predict the output.

So is this part of the person's name or not.

And what a recurrent neural network does is,

when it then goes on to read the second word in the sentence,

say x2, instead of just predicting y2 using only X2,

it also gets to input some information from whether the computer that time step one.

So in particular, deactivation value from time step one is passed on to time step two.

Then at the next time step,

recurrent neural network inputs

the third word X3 and it tries to output some prediction,

Y hat three and so on up until

the last time step where it inputs x_TX and then it outputs y_hat_ty.

At least in this example,

Ts is equal to ty and the architecture will change a bit if tx and ty are not identical.

So at each time step,

the recurrent neural network that passes on

as activation to the next time step for it to use.

And to kick off the whole thing,

we'll also have some either made-up activation at time zero,

this is usually the vector of zeros.

Some researchers will initialized a_zero randomly.

You have other ways to initialize a_zero but really having a vector of

zeros as the fake times zero activation is the most common choice.

So that gets input to the neural network.

In some research papers or in some books,

you see this type of neural network drawn with the following diagram in which at

every time step you input x and output y_hat.

Maybe sometimes there will be

a T index there and then to denote the recurrent connection,

sometimes people will draw a loop like that,

that the layer feeds back to the cell.

Sometimes, they'll draw a shaded box to denote that this is the shaded box here,

denotes a time delay of one step.

I personally find these recurrent diagrams much

harder to interpret and so throughout this course,

I'll tend to draw the unrolled diagram like the one you have on the left,

but if you see something like the diagram on

the right in a textbook or in a research paper,

what it really means or the way I tend to think about it is to mentally

unrow it into the diagram you have on the left instead.

The recurrent neural network scans through the data from left to right.

The parameters it uses for each time step are shared.

So there'll be a set of parameters which

we'll describe in greater detail on the next slide,

but the parameters governing the connection from X1 to the hidden layer,

will be some set of parameters we're going to write as Wax and is

the same parameters Wax that it uses for every time step.

I guess I could write Wax there as well.

Deactivations, the horizontal connections will be governed by

some set of parameters Waa and

the same parameters Waa use on every timestep and

similarly the sum Wya that governs the output predictions.

I'll describe on the next line exactly how these parameters work.

So, in this recurrent neural network,

what this means is that when making the prediction for y3,

it gets the information not only from X3 but also the information from x1 and x2 because

the information on x1 can pass through this way to help to prediction with Y3.

Now, one weakness of this RNN is that it only uses

the information that is earlier in the sequence to make a prediction.

In particular, when predicting y3,

it doesn't use information about the worst X4,

X5, X6 and so on.

So this is a problem because if you are given a sentence,

"He said Teddy Roosevelt was a great president."

In order to decide whether or not the word Teddy is part of a person's name,

it would be really useful to know not just information from the first two words but to

know information from the later words in the sentence

as well because the sentence could also have been,

"He said teddy bears they're on sale."

So given just the first three words is not possible to know

for sure whether the word Teddy is part of a person's name.

In the first example, it is.

In the second example, it is not.

But you can't tell the difference if you look only at the first three words.

So one limitation of

this particular neural network structure is that the prediction at a certain time

uses inputs or uses information from the inputs

earlier in the sequence but not information later in the sequence.

We will address this in a later video where we talk about

bi-directional recurrent neural networks or BRNNs.

But for now,

this simpler unidirectional neural network architecture

will suffice to explain the key concepts.

We'll just have to make a quick modifications to these ideas later to enable, say,

the prediction of y_hat_three to use both information

earlier in the sequence as well as information later in the sequence.

We'll get to that in a later video.

So, let's now write explicitly what are the calculations that this neural network does.

Here's a cleaned up version of the picture of the neural network.

As I mentioned previously, typically,

you started off with the input a_zero equals the vector of all zeros.

Next, this is what forward propagation looks like.

To compute a1, you would compute that as an activation function g applied to

Waa times a0 plus

Wax times x1 plus a bias.

I was going to write as ba,

and then to compute y hat.

One, the prediction at times at one,

that will be some activation function,

maybe a different activation function than the one above but

applied to Wya times a1 plus by.

The notation convention I'm going to use for the substrate of

these matrices like that example, Wax.

The second index means that this Wax is going to be multiplied by some X-like quantity,

and this a means that this is used to compute some a-like quantity like so.

Similarly, you noticed that here,

Wya is multiplied by some a-like quantity to compute a y-type quantity.

The activation function using or to compute

the activations will often be a tonnage in the choice of an RNN

and sometimes very loose are also used although the tonnage is actually

a pretty common choice and we

have other ways of preventing the vanishing gradient problem,

which we'll talk about later this week.

Depending on what your output y is,

if it is a binary classification problem,

then I guess you would use a sigmoid activation function,

or it could be a softmax that you have a k-way classification problem that

the choice of activation function here will depend on what type of output y you have.

So, for the name entity recognition task where y was either 01,

I guess a second g could be a sigmoid activation function.

Then I guess you could write g2 if you want to distinguish that

this could be different activation functions but I usually won't do that.

Then more generally, at time t,

at will be g of Waa times a

from the previous time step plus Wax of x from the current time step plus ba,

and y hat t is equal to g. Again,

it could be different activation functions but g of Wya times at plus by.

12:03

So, this equation is defined forward propagation in

a neural network where you would start off with a0 is the vector of all zeros,

and then using a0 and x1,

you will compute a1 and y hat one,

and then you take x2 and use x2 and a1 to compute a2 and y hat two,

and so on, and you'd carry out

forward propagation going from the left to the right of this picture.

Now, in order to help us develop the more complex neural networks,

I'm actually going to take this notation and simplify it a little bit.

So, let me copy these two equations to the next slide. Right, here they are.

What I'm going to do is actually take,

so to simplify the notation a bit,

I'm actually going to take that and write in a slightly simpler way.

So, I'm going to write this as at equals g times just a matrix

Wa times a new quantity which is going to be 80 minus one comma xt,

and then plus ba,

and so that underlying quantity on the left and right are supposed to be equivalent.

So the way we define Wa is we'll take this matrix Waa,

and this matrix Wax,

and put them side by side,

stack them horizontally as follows,

and this will be the matrix Wa.

So for example, if a was a 100 dimensional,

and in our running example x was 10,000 dimensional,

then Waa would have been a 100 by 100 dimensional matrix,

and Wax would have been a 100 by 10,000 dimensional matrix.

As we're stacking these two matrices together,

this would be 100-dimensional.

This will be 100,

and this would be I guess 10,000 elements.

So, Wa will be a 100 by 10100 dimensional matrix.

I guess this diagram on the left is not drawn to scale,

since Wax would be a very wide matrix.

What this notation means,

is to just take the two vectors and stack them together.

So, when you use that notation to denote that,

we're going to take the vector at minus one,

so that's a 100 dimensional and stack it on top of at,

so, this ends up being a 10100 dimensional vector.

So hopefully, you can check for yourself that this matrix

times this vector just gives you back the original quantity.

Right. Because now, this matrix Waa times

Wax multiplied by this at minus one xt vector,

this is just equal to Waa times at minus one plus Wax times xt,

which is exactly what we had back over here.

So, the advantage of this notation is that rather than carrying

around two parameter matrices, Waa and Wax,

we can compress them into just one parameter matrix Wa,

and just to simplify our notation for when we develop more complex models.

Then for this in a similar way,

I'm just going to rewrite this slightly.

I'm going to write this as Wy at plus by,

and now we just have two substrates in the notation Wy and by,

it denotes what type of output quantity we're computing.

So, Wy indicates a weight matrix or computing a y-like quantity,

and here at Wa and ba on top indicates where does this

parameters for computing like an a an activation output quantity.

So, that's it. You now know what is a basic recurrent neural network.

Next, let's talk about back propagation and how you will learn with these RNNs.