In the last video, you saw how the attention model allows

a neural network to pay attention to

only part of an input sentence while it's generating a translation,

much like a human translator might.

Let's now formalize that intuition into

the exact details of how you would implement an attention model.

So same as in the previous video,

let's assume you have an input sentence and you use a bidirectional RNN,

or bidirectional GRU, or bidirectional LSTM to compute features on every word.

In practice, GRUs and LSTMs are often used for this,

with maybe LSTMs be more common.

And so for the forward occurrence,

you have a forward occurrence first time step.

Activation backward occurrence, first time step.

Activation forward occurrence, second time step.

Activation backward and so on.

For all of them in just a forward fifth time step a backwards fifth time step.

We had a zero here technically we can also

have I guess a backwards sixth as a factor of all zero,

actually that's a factor of all zeroes.

And then to simplify the notation going forwards at every time step,

even though you have the features computed from

the forward occurrence and from the backward occurrence in the bidirectional RNN.

I'm just going to use a of t to represent both of these concatenated together.

So a of t is going to be a feature vector for

time step t. Although to be consistent with notation,

we're using second, I'm going to call this t_prime.

Actually, I'm going to use t_prime to index into the words in the French sentence.

Next, we have our forward only,

so it's a single direction RNN with state s to generate the translation.

And so the first time step,

it should generate y1 and just will have as input some context

C. And if you want to index it with time I guess you

could write a C1 but sometimes I just right C without the superscript one.

And this will depend on the attention parameters so alpha_11,

alpha_12 and so on tells us how much attention.

And so these alpha parameters tells us how much the context would depend

on the features we're getting or the activations we're

getting from the different time steps.

And so the way we define the context is actually be a way to some of

the features from the different time steps waited by these attention waits.

So more formally the attention waits will satisfy this that they are all be non-negative,

so it will be a zero positive and they'll sum to one.

We'll see later how to make sure this is true.

And we will have the context or the context at time one

often drop that superscript that's going to be sum over t_prime,

all the values of t_prime of this waited

sum of these activations.

So this term here are the attention waits and this term here comes from here.

So alpha(t_prime) is the amount of attention that's

yt should pay to a of t_prime.

So in other words,

when you're generating the t of the output words,

how much you should be paying attention to the t_primeth input to word.

So that's one step of generating the output and then at the next time step,

you generate the second output and is again done some of

where now you have a new set of attention waits on they to find a new way to sum.

That generates a new context.

This is also input and that allows you to generate the second word.

Only now just this way to sum becomes the context of

the second time step is sum over t_prime alpha(2, t_prime).

So using these context vectors.

C1 right there back,

C2, and so on.

This network up here looks like a pretty standard RNN sequence

with the context vectors as output and we

can just generate the translation one word at a time.

We have also define how to compute the context vectors in terms of

these attention ways and those features of the input sentence.

So the only remaining thing to do is to

define how to actually compute these attention waits.

Let's do that on the next slide.

So just to recap, alpha(t,

t_prime) is the amount of attention you should paid to

a(t_prime ) when you're trying to generate the t th words in the output translation.

So let me just write down the formula and we talk of how this works.

This is formula you could use the compute alpha(t,

t_prime) which is going to compute these terms e(t,

t_prime) and then use essentially a soft pass to make sure that

these waits sum to one if you sum over t_prime.

So for every fix value of t,

these things sum to one if you're summing over t_prime.

And using this soft max prioritization,

just ensures this properly sums to one.

Now how do we compute these factors e. Well,

one way to do so is to use a small neural network as follows.

So s t minus one was the neural network state from the previous time step.

So here is the network we have.

If you're trying to generate yt then st minus one was the hidden state from

the previous step that just fell into st

and that's one input to very small neural network.

Usually, one hidden layer in neural network because you need to compute these a lot.

And then a(t_prime) the features from time step t_prime is the other inputs.

And the intuition is,

if you want to decide how much attention to pay to the activation of t_prime.

Well, the things that seems like it should depend the most on

is what is your own hidden state activation from the previous time step.

You don't have the current state activation yet

because of context feeds into this so you haven't computed that.

But look at whatever you're hidden stages of this RNN generating

the upper translation and then for each of the positions,

each of the words look at their features.

So it seems pretty natural that alpha(t,

t_prime) and e(t, t_prime) should depend on these two quantities.

But we don't know what the function is.

So one thing you could do is just train a very small neural network

to learn whatever this function should be.

And trust that obligation trust wait and descent to learn the right function.

And it turns out that if you implemented

this whole model and train it with gradient descent,

the whole thing actually works.

This little neural network does a pretty decent job telling

you how much attention yt should pay to

a(t_prime) and this formula makes sure that

the attention waits sum to one and then as you chug along generating one word at a time,

this neural network actually pays attention to the right parts of

the input sentence that learns all this automatically using gradient descent.

Now, one downside to this algorithm is that

it does take quadratic time or quadratic cost to run this algorithm.

If you have tx words in the input and ty words in

the output then the total number of

these attention parameters are going to be tx times ty.

And so this algorithm runs in quadratic cost.

Although in machine translation applications where

neither input nor output sentences is

usually that long maybe quadratic cost is actually acceptable.

Although, there is some research work on trying to reduce costs as well.

Now, so far up in describing the attention idea in the context of machine translation.

Without going too much into detail this idea has been applied to other problems as well.

So just image captioning.

So in the image capturing problem the task is to

look at the picture and write a caption for that picture.

So in this paper set to the bottom by Kevin Chu,

Jimmy Barr, Ryan Kiros, Kelvin Shaw, Aaron Korver,

Russell Zarkutnov, Virta Zemo,

and Andrew Benjo they also showed that you could have a very similar architecture.

Look at the picture and pay attention only to parts

of the picture at a time while you're writing a caption for a picture.

So if you're interested, then I encourage you to take a look at that paper as well.

And you get to play with all this and more in the programming exercise.

Whereas machine translation is a very complicated problem in the prior exercise you

get to implement and play of the attention while you

yourself for the date normalization problem.

So the problem inputting a date like this.

This actually has a date of the Apollo Moon landing and normalizing it into

standard formats or a date like this and having a neural network a sequence,

sequence model normalize it to this format.

This by the way is the birthday of William Shakespeare.

Also it's believed to be.

And what you see in prior exercises as you can train

a neural network to input dates in any of

these formats and have it use an attention model

to generate a normalized format for these dates.

One other thing that sometimes fun to do is

to look at the visualizations of the attention waits.

So here's a machine translation example and here were plotted in different colors.

the magnitude of the different attention waits.

I don't want to spend too much time on this but you find that

the corresponding input and output words

you find that the attention waits will tend to be high.

Thus, suggesting that when it's generating a specific word in output is,

usually paying attention to the correct words in the input and all this including

learning where to pay attention when was all

learned using propagation with an attention model.

So that's it for the attention model

really one of the most powerful ideas in deep learning.

I hope you enjoy implementing and playing with

these ideas yourself later in this week's programming exercises.