Hi! During this week,

you have already learnt about traditional NLP methods for

such tasks as a language modeling or part of speech tagging or named-entity recognition.

So in this lesson,

we are going to cover the same tasks but with neural networks.

So neural networks is a very strong technique,

and they give state of the art performance now for these kind of tasks.

So please stay with me for this lesson.

This is just the recap of what we have for language modeling.

So the task is to predict next words,

given some previous words,

and we know that, for example,

with 4-gram language model,

we can do this just by counting the n-grams and normalizing them.

Now, let us take a closer look and let us discuss a very important problem here.

Imagine that you have some data,

and you have some similar words in this data like good and great here.

In our current model,

we treat these words just as separate items.

So for us, they are just separate indices in

the vocabulary or let us say this in terms of neural language models.

You have one-hot encoding,

which means that you encode your words with a long,

long vector of the vocabulary size,

and you have zeros in this vector and just one non-zero element,

which corresponds to the index of the words.

So this encoding is not very nice.

Why? Imagine that you see "have a good day" a lot of times in your data,

but you have never seen "have a great day".

So if you could understand that good and great are similar,

you could probably estimate some very good probabilities

for "have a great day" even though you have never seen this.

Just by saying okay,

maybe "have a great day" behaves

exactly the same way as "have a good day" because they're similar,

but if it reads the words independently,

you cannot do this.

I want you to realize that it is really

a huge problem because the language is really variative.

Just another example, let us say we have lots of breeds of dogs,

you can never assume that you have all this breeds of dogs in your data,

but maybe you have dog in your data.

So if you just know that they are somehow similar,

you can know how some particular types of dogs occur in data just

by transferring your knowledge from dogs.

Great.

What can we do about it?

Well, this is called distributed representations,

and this is exactly about fixing this problem.

So now, we are going to represent our words with their low-dimensional vectors.

So that dimension will be m,

something like 300 or maybe 1000 at most,

and this vectors will be dense.

And we are going to learn this vectors.

Importantly, we will hope that similar words will have similar vectors.

For example, good and great will be similar,

and dog will be not similar to them.

I ask you to remember this notation in the bottom of the slide,

so the C matrix will be built by this vector representations,

and each row will correspond to some words.

So we are going to define probabilistic model of

data using these distributed representations.

And we are going to learn lots of parameters including these distributed representations.

This is the model that tries to do this.

Actually, this is a very famous model from 2003 by Bengio,

and this model is one of the first neural probabilistic language models.

So this slide maybe not very understandable for yo.

That's okay. I just want you to get the idea of the big picture.

So you have your words in the bottom,

and you feed them to your neural network.

So first, you encode them with the C matrix,

then some computations occur,

and after that, you have a long y vector in the top of the slide.

So this vector has as many elements as words in the vocabulary,

and every element correspond to the probability of these certain words in your model.

Now, let us go in more details,

and let us see what are the formulas for the bottom,

the middle, and the top part of this neural network.

Looks scary, isn't it? Don't be scared.

I will break it down for you.

So the last thing that we do in our neural network is softmax.

We apply to the components of y vector.

The y vector is as long as the size of the vocabulary,

which means that we will get some probabilities normalized over words in the vocabulary,

and that's what we need.

What happens in the middle of our neural network?

There is some huge computations here with lots of parameters.

Actually, every letter in this line is some parameters,

either matrix or vector.

The only letter which is not parameters is x,.

So what is x?

X is the representation of our context.

You remember our C matrix,

which is just distributed representation of words.

So you take the representations of all the words in your context,

and you concatenate them, and you get x.

So just once again from bottom to the top this time.

You get your context representation.

You feed it to your neural network to compute

y and you normalize it to get probabilities.

Now, to check that we understand everything,

it's always very good to try to understand the dimensions of all the matrices here.

For example, what is the dimension of W matrix?

Well, we can write it down like that,

and we can see that what we want to get in the result of this formula,

has the dimension of the size of the vocabulary.

Now what is the dimension of x?

Well, x is the concatenation of

m dimensional representations of n minus 1 words from the context.

So it is m multiplied by n minus 1.

Here you go. You can see the dimension of W matrix.

So this neural network is great,

but it is kind of over-complicated.

So you can see that you have some non-linearities here,

and it can be really time-consuming to compute this.

So the next slide is about a model which is simpler.

Let's try to understand this one.

It is called log-bilinear language model.

Maybe it doesn't look like something more simpler but it is.

So let us figure out what happens here.

You still have some softmax,

so you still produce some probabilities,

but you have some other values to normalize.

So you have some bias term b,

which is not important now.

The important part is the multiplication

of word representation and context representation.

Let's figure out what are they.

So the word representation is easy.

It's just the row of your C matrix.

What is the context representation?

You still get your rows of the C matrix to represent individual words in the context,

but then you multiply them by Wk matrices,

and this matrices are different for different positions in the context.

So it's actually a nice model.

It is not a bag-of-words model.

It tries to capture somehow that words that just go before your target words can

influence the probability in some other way than

those words that are somewhere far away in the history.

So you get your word representation and context representation.

And then you just have dot product of them to compute the similarity,

and you normalize this similarity.

So the model is very intuitive.

It predicts those words that are similar to the context.

Great. This is all for feedforward neural networks for language modeling.

The next video is about recurrent neural networks.

So see you there.