0:30

So in this simple setup, we are interested in analyzing

Â one document and trying to discover just one topic.

Â So this is the simplest case of topic model.

Â The input now no longer has k, which is the number of topics because we

Â know there is only one topic and the collection has only one document, also.

Â In the output, we also no longer have coverage because

Â we assumed that the document covers this topic 100%.

Â So the main goal is just to discover the world of probabilities for

Â this single topic, as shown here.

Â 1:14

As always, when we think about using a generating model to solve such a problem,

Â we start with thinking about what kind of data we are going to model or

Â from what perspective we're going to model the data or data representation.

Â And then we're going to design a specific model for

Â the generating of the data, from our perspective.

Â Where our perspective just means we want to take a particular angle of looking at

Â the data, so that the model will have the right parameters for

Â discovering the knowledge that we want.

Â And then we'll be thinking about the microfunction or

Â write down the microfunction to capture more formally how likely

Â a data point will be obtained from this model.

Â 2:05

And the likelihood function will have some parameters in the function.

Â And then we argue our interest in estimating those parameters for example,

Â by maximizing the likelihood which will lead to maximum likelihood estimated.

Â These estimator parameters will then become the output

Â of the mining hours, which means we'll take the estimating

Â parameters as the knowledge that we discover from the text.

Â So let's look at these steps for this very simple case.

Â Later we'll look at this procedure for some more complicated cases.

Â So our data, in this case is, just a document which is a sequence of words.

Â Each word here is denoted by x sub i.

Â Our model is a Unigram language model.

Â A word distribution that we hope to denote a topic and that's our goal.

Â So we will have as many parameters as many words in our vocabulary, in this case M.

Â 4:15

Now when we do this transformation, we then would need to

Â introduce a counter function here.

Â This denotes the count of word one in document and

Â similarly this is the count of words of n in the document

Â because these words might have repeated occurrences.

Â You can also see if a word did not occur in the document.

Â 4:41

It will have a zero count, therefore that corresponding term will disappear.

Â So this is a very useful form of

Â writing down the likelihood function that we will often use later.

Â So I want you to pay attention to this, just get familiar with this notation.

Â It's just to change the product over all the different words in the vocabulary.

Â So in the end, of course, we'll use theta sub i to express this likelihood

Â function and it would look like this.

Â Next, we're going to find the theta values or probabilities

Â of these words that would maximize this likelihood function.

Â So now lets take a look at the maximum likelihood estimate problem more closely.

Â 5:47

maximize the local likelihood instead of the original likelihood.

Â And this is purely for mathematical convenience because after

Â the logarithm transformation our function will becomes a sum instead of product.

Â And we also have constraints over these these probabilities.

Â The sum makes it easier to take derivative, which is often needed for

Â finding the optimal solution of this function.

Â So please take a look at this sum again, here.

Â And this is a form of a function that you will often

Â see later also, the more general topic models.

Â So it's a sum over all the words in the vocabulary.

Â And inside the sum there is a count of a word in the document.

Â And this is macroed by the logarithm of a probability.

Â 6:58

Now at this point the problem is purely a mathematical problem because we are going

Â to just the find the optimal solution of a constrained maximization problem.

Â The objective function is the likelihood function and

Â the constraint is that all these probabilities must sum to one.

Â So, one way to solve the problem is to use Lagrange multiplier approace.

Â 7:39

So in this approach we will construct a Lagrange function, here.

Â And this function will combine our objective function

Â with another term that encodes our constraint and

Â we introduce Lagrange multiplier here,

Â lambda, so it's an additional parameter.

Â Now, the idea of this approach is just to turn the constraint optimization into,

Â in some sense, an unconstrained optimizing problem.

Â Now we are just interested in optimizing this Lagrange function.

Â 8:19

As you may recall from calculus, an optimal point

Â would be achieved when the derivative is set to zero.

Â This is a necessary condition.

Â It's not sufficient, though.

Â So if we do that you will see the partial derivative,

Â with respect to theta i here ,is equal to this.

Â And this part comes from the derivative of the logarithm function and

Â this lambda is simply taken from here.

Â And when we set it to zero we can

Â easily see theta sub i is related to lambda in this way.

Â 9:16

And this is just a net sum of all the counts.

Â And this further allows us to then solve the optimization problem,

Â eventually, to find the optimal setting for theta sub i.

Â And if you look at this formula it turns out that it's actually very intuitive

Â because this is just the normalized count of these words by the document ns,

Â which is also a sum of all the counts of words in the document.

Â So, after all this mess, after all,

Â we have just obtained something that's very intuitive and

Â this will be just our intuition where we want to

Â maximize the data by assigning as much probability

Â mass as possible to all the observed the words here.

Â And you might also notice that this is the general result of maximum likelihood

Â raised estimator.

Â In general, the estimator would be to normalize counts and it's just sometimes

Â the counts have to be done in a particular way, as you will also see later.

Â So this is basically an analytical solution to our optimization problem.

Â In general though, when the likelihood function is very complicated, we're not

Â going to be able to solve the optimization problem by having a closed form formula.

Â Instead we have to use some numerical algorithms and

Â we're going to see such cases later, also.

Â So if you imagine what would we get if we use such a maximum

Â likelihood estimator to estimate one topic for a single document d here?

Â Let's imagine this document is a text mining paper.

Â Now, what you might see is something that looks like this.

Â On the top, you will see the high probability words tend to be those very

Â common words, often functional words in English.

Â And this will be followed by some content words that really

Â characterize the topic well like text, mining, etc.

Â And then in the end, you also see there is more probability of

Â words that are not really related to the topic but

Â they might be extraneously mentioned in the document.

Â As a topic representation, you will see this is not ideal, right?

Â That because the high probability words are functional words,

Â they are not really characterizing the topic.

Â So my question is how can we get rid of such common words?

Â