[MUSIC] So, the core idea here is that basically you want the words that have similar neighbors, similar contexts, to be similar in this new virtual representation. Now, let's see how we achieve that. But before we do that, let's actually cover some kind of math of how do we represent the words efficiently. So have a word named word. And technically, to fit it into TensorFlow, you'd probably have to represent it as some kind of number. For example, the ID of this word in your dictionary. And basically, the way you usually use this word in your pipeline is you take one-hot vectors, this large size of a dictionary vector that only has one nonzero value. And then push it through some kind of linear models or neural networks, or similar stuff. The only problem is, you're actually doing this thing very inefficiently. So you have this one-hot vector, and then you multiply it by a weight vector, or a weight matrix. It actually, it's actually [INAUDIBLE] process, because you have a lot of weights that gets multiplied by zeros. Now, you could actually compute this kind of weighted sum much more efficiently. If you look slightly closer, you could actually write the answer, you could actually write the answer itself without any sums or multiplications. Could you do that? Yeah, exactly. You could just take one weight corresponding to this one here, so weight ID 1337. And this weight would be equal to your whole product, because everything else is 0. I could use the same approach when I have multiple kind of neurons. So let's say instead of vector, we now multiply by a matrix. So now in this case you could also try to deduce the result, but think of it as a matrix product being just a lot of vector products stacked. Now, how do you compute this particular kind of output activation vector of your dense layer if you use this kind of one-hot representation here? Yeah, exactly. So you can simply do this as you've taken the first column vector, and of this vector, the only thing remaining is the element under ID 1337. Then take the second vector, and then it has the corresponding element of this vector now as your second activation. Now you have the third, fourth, and so on, have as many vectors as you have kind of hidden units. And if you visualize all the remaining values in this matrix, to just be one row of this matrix. Basically what this says is that you can replace this large scale multiplication by simply taking one row. And this of course speeds up your computation a lot. Now let's finally get back to word2vec. Remember, we want to train some kind of vector representation so that the words with similar contexts get similar vectors. The kind of general idea of how we do that. So we define a model which is very close to how you kind of your autoencoder. Trains the representation you want as a byproduct of this kind of model training. And what it actually tries to do is it tries to predict words' contexts. So basically it only has one input, the word, say, the word liked. And it wants to predict the probability of some other word for every word being a neighbor of this word. So basically, you would expect this model to have large probabilities at the output for words that coincide with your word like, like the restaurant, for example. And small probabilities for the words that don't coincide. Now, the problem here is that this is kind of under-defined machine learning task. So you can't actually perfectly predict that. But we don't even need that. So actually, we don't expect our model to predict probabilities ideally. So perturbations are necessary, because we only need this model to obtain this first kind of matrix. So it's the guy on the left here. Now, actually, what this matrix does, is it takes one-hot vector representation of one word, multiplies it by a matrix of weight. So since we already know that this multiplication can be simplified, it's basically the idea that you have these matrix, and for each kind of mini batch, for one word, you take the corresponding row of this matrix, and then send it forward along your network. The second layer tries to take this representation, this word vector. I won't be afraid of this word. To predict the neighboring words via dense layer, basically, by affine transformation. Now, it kind of makes sense that if two words correspond to similar contexts, it's beneficial for the model to assign them to similar vectors. Therefore, this kind of second matrix will be able to map them into similar contexts automatically. And if words have, well, distinguishable contexts, if they are quite different, then [INAUDIBLE]. Your word vectors have to be different for the two words, so that the second matrix will be able to produce different results. So you train this model by simply taking samples from the dataset. Taking samples, taking just, sentences, basically. Then picking one word out of that sentence and using this word as the input. This is your middle word here. Now, all other words in the sentence are considered as kind of target reference answer for your model and it tries to predict them. And well, once it converges, you can more or less count on this first matrix being the word representation you actually want. Now, this is not the only way you can train word2vec. Now, another popular variation of this model is you could try to flip the whole thing and try to take all words but for one as input and try to predict the missing word. This is called a skip-gram model. So first model, take one word, predict everything else. The second model is take kind of every word but for one and try to restore this one missing word. Now, okay, so basically, those models are kind of symmetric. And they're non-similar representations up to, well, some minor changes. So the general idea stays the same. And you can, again, use one of those two matrices as your word-embedding matrix now. Because for example, in the word-to-context model, what you had is, this first matrix was, for every single possible sample, only one row of this matrix was kind of used at a time. So basically, you could assume that this kind of of row in the matrix is the vector corresponding to your word and use this matrix as your word embedding. If you train this model by yourself or if you use a pretrained model, you'd actually notice that it has a lot of peculiar properties on top of what we actually wanted it to have. So of course, it does what we actually trained it for. It trains similar vectors for synonyms, and different vectors for semantically different words. But there's also a very peculiar effect called kind of linear algebra, word algebra. For example, if you take a vector of king, then subtract from it the vector of man and the vector of woman, it gets something very close to the vector of queen. So kind of, king minus man plus woman equals queen. Or another example, you could take moscow minus russia plus france equals paris. And they kind of make sense, well, they're kind of underdefined, in mathematical terms. And this is just a side effect of the smaller training. So, again, like other models we've studied previously, this is not a desired, kind of originally intended effect. But it's very kind of interesting and sometimes it's even helpful for applications of these word embedding models. Now, if you visualize those word vectors, for example, if you take first two principal components of your trained word vectors, it also emerges that this linear algebraic stands very nicely to kind of structured space of those word embeddings. For example, in many cases, you may expect a kind of similar direction vector connecting all countries to their corresponding capitals. Or all male profession names to the corresponding female profession names. So there's a lot of those particular properties. Of course, you cannot expect them to be 100% certainly trainable. So sometimes you get the desired effects, sometimes you just get rubbish. And of course, the model doesn't strictly apply to the exact same distance have to be preserved through, it just trains something peculiar by the way it trains. And this kind of coincides with the idea that, for example, autoencoders and other unsupervised learning methods have a lot of kind of unexpected properties that they all satisfy. So hopefully by now I managed to convince you that having those word vectors around is really convenient, or at least cool, because they have all those nice properties. It's later going to turn out those word vectors are really crucial for some other deep learning applications to natural language processing, like recurrent neural networks. But before we cover that, let's actually find out, how do we train them, how do we obtain those vectors to start collecting the benefits from them. Now, the engineering part of the problem is, well, it's both simple and really complicated. The simple part is data. You can take any kind of text you want, text collection. You can use poetry, Wikipedia, well, you can just get all the news you can gather before your account gets banned. And basically, this comes from the fact that word2vec can be trained on any sort of text without any labels. The second part, which gets really complicated, is how do you make it so that this model computes efficiently and trains efficiently. The problem is that the two parts, the two large matrix multiplications here, are really asymmetric in terms of how the complexity works for them. The first one, which basically takes a one-hot vector and multiplies it by a matrix, can be, as you already know, replaced by just taking one row of this matrix. But the second one doesn't have this property, because it uses a dense vector. So you compute this thing naively, you're actually going to face a matrix multiplication the scale of, say, 100 vector dimension by 10 to the power of 4 or 5 possible words. Which is kind of hard for a model that only has two layers in it. And the hardest part here is that you cannot actually cheat by computing the partial output of this matrix. Because actually, the problem here is that this kind of second layer, it tries to [INAUDIBLE] here. And when you think probability in deep learning, you actually mean softmax here. The problem with softmax is that to compute just one class probability with softmax, you have to exponentiate the logit for this class and then divide it by exponentiate the logits for all possible classes, including this one. And the second kind of [INAUDIBLE] part here is really hard, because you have to add up probabilities, kind of. Not unnormalized probabilities, exponentiated logits from all classes in order to compute just one output. Now, okay, you could of course do this. Theoretically, there is enough GPU space to do that on modern GPUs, or it's even feasible on CPUs. But the problem is that it's a very simple operation that requires a lot of power here. So instead there are some kind of special modifications of softmax like hierarchical softmax or sample softmax, which try to estimate this thing more efficiently, sacrificing either some of the mathematical properties or sacrificing the fact that the softmax has deterministic probability, so just adding some noise. There's also a number of similar models that try to avoid computing probabilities, avoid having softmax altogether. Like this GloVe, Global Vectors, which uses no such nonlinearity in its pipeline. Now, finally, word embeddings can be extended to high-level representations. You can find embeddings for different objects. For example, you can find embeddings for the entire sentence, which makes this kind of hierarchical method. Or you could try to find embedding for a specific data, like maybe an amino acid. There is a special, from bioinformatic, I know this model called protein2vec that tries to vectorize protein components. And this thing is more or less advanced part of natural language processing. We'll add links describing it into the readings section. But you can more or less expect that they'll be covered in more detail in the natural language oriented course in our specialization. Now, okay, so basically, to be continued. If you're intrigued by this, you can jump to the reading section before the NLP course starts. So this basically concludes the part of today's lecture dedicated to natural language, and word embeddings, in particular. But don't worry. [INAUDIBLE] reading section, we'll also have the entire next week dedicated to advanced applications for natural language processing. We'll study recurrent neural networks that can, when paired with word embeddings of course, solve not only the text classification like sentiment analysis. But also the inverse problem, like generating the text given a particular kind of task. This, of course, coincides very well with your course project, which is generating text captions given images. See you in the next section. [SOUND] [MUSIC]