To build such a model using an RNN
you would first need a training set comprising a large corpus of english text.
Or text from whatever language you want to build a language model of.
And the word corpus is an NLP terminology that just means a large body or
a very large set of english text of english sentences.
So let's say you get a sentence in your training set as follows.
Cats average 15 hours of sleep a day.
The first thing you would do is tokenize this sentence.
And that means you would form a vocabulary as we saw in an earlier video.
And then map each of these words to, say, one hunt vectors,
alter indices in your vocabulary.
One thing you might also want to do is model when sentences end.
So another common thing to do is to add an extra token called a EOS.
That stands for End Of Sentence that can help you figure out when a sentence ends.
We'll talk more about this later,
but the EOS token can be appended to the end of every sentence in your training
sets if you want your models explicitly capture when sentences end.
We won't use the end of sentence token for the programming exercise at the end
of this week where for some applications, you might want to use this.
And we'll see later where this comes in handy.
So in this example, we have y1, y2, y3, 4, 5, 6, 7, 8, 9.
Nine inputs in this example if you append the end of sentence token to the end.
And doing the tokenization step, you can decide whether or
not the period should be a token as well.
In this example, I'm just ignoring punctuation.
So I'm just using day as another token.
And omitting the period, if you want to treat the period or other punctuation
as explicit token, then you can add the period to you vocabulary as well.
Now, one other detail would be what if some of the words in your training set,
are not in your vocabulary.
So if your vocabulary uses 10,000 words, maybe the 10,000 most common
words in English, then the term Mau as in the Egyptian Mau is a breed of cat,
that might not be in one of your top 10,000 tokens.
So in that case you could take the word Mau and replace it with a unique
token called UNK or stands for unknown words and would just model,
the chance of the unknown word instead of the specific word now.
Having carried out the tokenization step which basically means
taking the input sentence and mapping out to the individual tokens or
the individual words in your vocabulary.
Next let's build an RNN to model the chance of these different sequences.
And one of the things we'll see on the next slide is that you end
up setting the inputs x<t> = y<t-1> or you see that in a little bit.
So let's go on to built the RNN model and
I'm going to continue to use this sentence as the running example.
This will be an RNN architecture.
At time 0 you're going to end up computing some
activation a1 as a function of some inputs x1, and
x1 will just be set it to the set of all zeroes, to 0 vector.
And the previous A0, by convention, also set that to vector zeroes.
But what A1 does is it will make a soft max prediction to
try to figure out what is the probability of the first words y.
And so that's going to be y<1>.
So what this step does is really, it has a soft max it's trying to predict.
What is the probability of any word in the dictionary?
That the first one is a, what's the chance that the first word is Aaron?
And then what's the chance that the first word is cats?
All the way to what's the chance the first word is Zulu?
Or what's the first chance that the first word is an unknown word?
Or what's the first chance that the first word is the in the sentence they'll have,
shouldn't have to read?
Right, so y hat 1 is output to a soft max, it just predicts what's
the chance of the first word being, whatever it ends up being.
And in our example, it wind up being the word cats, so this would be a 10,000
way soft max output, if you have a 10,000-word vocabulary.
Or 10,002, I guess you could call unknown word and
the sentence is two additional tokens.
Then, the RNN steps forward to the next step and
has some activation, a<1> to the next step.
And at this step, his job is try figure out, what is the second word?