0:02

In this video, we will speak about simple Recurrent Neural Network,

how to use Backpropagation to train.

In the previous video, we spoke about a recurrent architecture.

In this architecture, we work with sequences, element by element.

Here, Xs are the elements of the input sequence and at each time step,

we have some output or a prediction right ahead.

We can use this architecture for a lot of different tasks.

For example, if we want to use it for language modeling,

then Xs should be word embeddings and the output

of our model at each time step is a probability distribution over a vocabulary,

which says which word is more probable to be the next one in our sequence.

Of course, we can generate this word from this probability distribution and give

it to our model as an input in the next time step so we can generate text.

We can use a recurrent architecture to solve a post tagging task.

In this task, we want to tag each word to the text,

which is a part of speech label.

Here we have the same inputs, word embeddings,

but the output is different now.

Now, at each time step,

the output is the probability distribution over the set of

all possible post tags which says which post tag is more probable to the current word.

Now we see that we can use this architecture for different tasks and here,

in the middle of this architecture,

we see this box MLP,

which can be quite different inside.

Let's use the symbols MLP, which exist.

The MLP is one layer of hidden neurons.

Then, we have a vector of hidden neurons in the front layer of our network

and we can use these hidden neurons both

to calculate the output of our networks or the predictions we had,

and to calculate the hidden neurons from the next time step.

On the slide, you can see formulas for forward pass in our network,

an action that follows are pretty similar to the ones you already saw in the course.

Here, again, you have some known neural functions

F often in your combinations of the inputs.

But here you need to remember that all the parameters are shared between

time steps so we use same word matrices and the same bisectors at each time stamp.

This type of visualization of recurrent neural network usually called a full diversion.

And there is another diversion of the rotation for recurrent neural network,

the more complex one.

Actually, this complex visualization is more general

because it doesn't depend on the length of the input sequence.

While the full diversion is actually depends on it,

but in the computerization,

we have these recurrent connection,

which forms a loop in our scheme and it's actually

not clear how to train a network with a loop.

That's why we will use a full visualization when we will speak about training.

Now, let's speak about training of Recurrent Neural Network.

Here we see a Recurrent Neural Network in the unfolded form.

To train it training we actually need some data.

Let's say that we have a training dataset so for some training sequence we know

all the true label's y and we make

some predictions y-hat and to understand how work are we,

use some loss function L but we use this loss function separately at each time step,

and then we sum up all of the sources to take their total loss, to have a total loss.

Now, we have our usual neural network,

we have data, we have a loss, what should we do?

We need to backpropagate.

As usual, in backpropagation, of the system,

we need to make a forward pass and backward pass.

In forward pass, we need to calculate all of the elements in

our own neural network so we calculate the hidden elements,

then we make our predictions, and we calculate the loss.

In the backward pass,

we need to calculate the gradient of our loss function with respect to

all the parameters so respect to weight matrices and bisvectors.

In usual fit for our neural network,

we make a backpropagation on different layers so in the vertical direction,

but here we have sequences.

This why we need to backpropagate not only through layers but

also through time in result direction.

That's why the algorithm of training of recurrent neural networks,

usually called backpropagation through time.

Additionally, we need to remember that all the parameters are shared between time steps.

Therefore, if you want to calculate the gradient

of our loss with respect to some parameter,

we actually need to sum up the gradient from all the time steps.

Let's, for example, calculate the gradient of

loss function with respect to weight matrix U.

As I already said, we need to sum up the gradients from all these time steps in

our sequence and to calculate the gradient of the loss at the specific time step t,

we can simply use a chain rule.

As a result, we have this product and the first element in this product,

we calculate if we know which loss function we used,

and the second element,

we can actually also calculate very simply

because our prediction depends on the matrix U only in one point.

Now let's consider a more difficult example.

Let's calculate the gradient of our loss function with respect to

the recurrent weight matrix W.

In first step, we will do the same thing order.

We simply sum up the gradients from all the time steps in

our sequence and actually we want to do the same thing thing order in the next step,

but if we look at the formula for the hidden units,

we see that there is not only one dependence

between our weight matrix W and hidden units at

time step t because hidden units in the previous time step actually also depend on W.

So, here, we need to roll deeper and use not a simple chain rule,

but use a formula of total derivative.

After finding this formula,

we end up with the following equation for gradient for

our loss function with respect to W.

At the last part of this equation,

you can see that the first we take into

account dependence between hidden units at time steps t and the

weight matrix W.

When we take one step back and take into account the dependence

between previous hidden units and weight matrix W,

and then we take one step back and so on and so on while we

don't reach the beginning power sequence.

As a result, here,

the last part of this equation is the sum of

the contributions from the old previous time steps toward

the gradient at time step t.

And to calculate the contribution from time step k to the gradient at time step t,

we actually need to go from the hidden units at time step t

to hidden units at time step k. In each element on the sum,

we have this product of the combined matrices of the gradients of hidden units

at one time step with respect to the hidden units in the previous time step.

Now, we calculated the gradient of our loss function

with respect to two weight matrices, U and W.

When we calculated the gradient of our loss function with respect to U,

we needed only to go backwards in layers,

so in vertical direction.

When we calculated the gradient with respect to W, we need to go

backwards both in vertical direction and in horizontal direction,

so backwards in layers and time.

Now we have the last weight matrix V,

and it's a question for you.

Do we really to go backwards in time to calculate the gradient of

our loss function with respect to weight matrix V?

Yes, here we have the same situation as with

the recurrent weight matrix W because the dependence

between hidden units and weight matrix V is actually not only in one place.

Hidden units from all the previous time steps also depends

from V so we need to go backwards in time to calculate this gradient.

Let's summarize what you've learned in this video.

We now know what is the simple Recurrent Neural Network is

and how to train it using Backpropagation through time.

This backpropagation through time algorithm is actually a simple backpropagation,

but with a fancy name.

In the next video, we will discuss

the difficulties that arise during the training of Recurrent Neural Networks.