You've already learned about the basic structure of an RNN.

In this video, you'll see how backpropagation in a recurrent neural network works.

As usual, when you implement this in one of the programming frameworks, often,

the programming framework will automatically take care of backpropagation.

But I think it's still useful to have a rough sense of how backprop works in RNNs.

Let's take a look.

You've seen how, for forward prop,

you would computes these activations from left to right as follows in the neural network,

and so you've outputs all of the predictions.

In backprop, as you might already have guessed,

you end up carrying

backpropagation calculations in basically the opposite direction

of the forward prop arrows.

So, let's go through the forward propagation calculation.

You're given this input sequence x_1, x_2,

x_3, up to x_tx.

And then using x_1 and say, a_0,

you're going to compute the activation, times that one,

and then together, x_2 together with a_1 are used to compute a_2,

and then a_3, and so on, up to a_tx.

All right. And then to actually compute a_1,

you also need the parameters.

We'll just draw this in green, W_a and b_a,

those are the parameters that are used to compute a_1.

And then, these parameters are actually used for every single timestep so,

these parameters are actually used to compute a_2, a_3,

and so on, all the activations up to last timestep depend on the parameters W_a and b_a.

Let's keep fleshing out this graph.

Now, given a_1, your neural network can then compute the first prediction, y-hat_1,

and then the second timestep, y-hat_2, y-hat_3,

and so on, with y-hat_ty.

And let me again draw the parameters of a different color.

So, to compute y-hat,

you need the parameters,

W_y as well as b_y,

and this goes into this node as well as all the others.

So, I'll draw this in green as well.

Next, in order to compute backpropagation,

you need a loss function.

So let's define an element-wise loss force,

which is supposed for a certain word in the sequence.

It is a person's name,

so y_t is one.

And your neural network outputs some probability of maybe

0.1 of the particular word being a person's name.

So I'm going to define this as the standard logistic regression loss,

also called the cross entropy loss.

This may look familiar to you from where we were previously

looking at binary classification problems.

So this is the loss associated with

a single prediction at a single position or at a single time set,

t, for a single word.

Let's now define the overall loss of the entire sequence,

so L will be defined as the sum overall t equals one to,

i guess, T_x or T_y.

T_x is equals to T_y in this example of the losses

for the individual timesteps, comma y_t.

And then, so, just have to L without this

superscript T. This is the loss for the entire sequence.

So, in a computation graph,

to compute the loss given y-hat_1,

you can then compute the loss for

the first timestep given that you compute the loss for the second timestep,

the loss for the third timestep,

and so on, the loss for the final timestep.

And then lastly, to compute the overall loss,

we will take these and sum them all up to compute the final L using that equation,

which is the sum of the individual per timestep losses.

So, this is the computation problem

and from the earlier examples you've seen of backpropagation,

it shouldn't surprise you that backprop then just requires

doing computations or parsing messages in the opposite directions.

So, all of the four propagation steps arrows,

so you end up doing that.

And that then, allows you to compute all the appropriate quantities that lets you then,

take the riveters, respected parameters,

and update the parameters using gradient descent.

Now, in this back propagation procedure,

the most significant message or the most significant recursive calculation is this one,

which goes from right to left,

and that's why it gives this algorithm as well,

a pretty fast full name called backpropagation through time.

And the motivation for this name is that for forward prop,

you are scanning from left to right,

increasing indices of the time, t, whereas,

the backpropagation, you're going from right to left,

you're kind of going backwards in time.

So this gives this, I think a really cool name,

backpropagation through time, where you're going backwards in time, right?

That phrase really makes it sound like you need a time machine to implement this output,

but I just thought that backprop through time is

just one of the coolest names for an algorithm.

So, I hope that gives you a sense of how forward prop and backprop in RNN works.

Now, so far, you've only seen this main motivating example in RNN,

in which the length of the input sequence was equal to the length of the output sequence.

In the next video,

I want to show you a much wider range of RNN architecture,

so I'll let you tackle a much wider set of applications. Let's go on to the next video.