So let's take a look at tokenizing and padding a lot more data. So instead of just a few sentences that we were looking at earlier on, we're actually going to use a full data set. And this is the sarcasm data set that we were talking about in the lesson. So here is, I'm just going to wget that sarcasm data set and download it to /tmp/sarcasm.json. So once I've downloaded it, I'm now going to just create three lists, one for the sentences, one for the labels, and one for the URLs. I'm not actually going to use the URLs here, but just if you want to use them yourself here's how you do it. And then for each item in the data store, because in JSON it's really easy for me to iterate through it. For each item in the data store I can then just add the headline to sentences, I can add the item is_sarcastic to label, and I can add the item article_link to URLs. Okay now that my data is ready I'm going to input my tokenizer and my pad sequences as before, I'm going to use the tokenizer to specify my out of vocabulary token. And I'm going to call it to fit_on_texts to all of the sentences. Let's actually run it and see what happens. So we run we see it's downloading the data. Now it's a case of we've a much larger vocabulary, we have 29,657 words in the index. And we can start seeing things like out of vocabulary is 1, to is number 2, of is number 3, the is number 4. Remember these are going to be sorted into their order of commonality. So you should see basic words like the and for pretty high up in the list. So once I have my word index, and you can see, I'm just printing out the length of that which is what gives me the 29,657. And I printed the word index. So now on my tokenizer I can call and texts_to_sequences as before and get my sentences into that to turn them into sequences. And the last screencast we used pad sequences, but the padding was used in the default, which was pre, and everything was prefixed with zeros. This time to show padding=post, will just allow us to put zeros after the sentence so that the sentence will be post padded. And if I for example look at sentence number two in the corpus and what padded to in the corpus looks like we'll see sentence number two is mom starting to fear sons web series is the closest thing she'll have to a grandchild. And we can see the tokens for those actual words in here. So, for example here was number two moms starting to and we can see that the word to to is number 2. And maybe, are there any others that are pretty high up in the list? We could say for example number three 39 if I go through my list here and see what 39 is, We'll see that the word will is number 39. So if I come back to here she will have to a grandchild is number 39. And then finally the shape of the padded one is that each of the sentences in the data set is being padded up to 40 characters long, or 40 words long. And so we have 26,709 sentences in the data set, so my shape of my padded array will be that. And this is what could be used to then train a neural network with embeddings, that you'll be seeing next week.