But before we presented Latent Dirichlet allocation or LDA for shorthand.

We presented an alternative document clustering model,

where we introduce a set of topic's specific distributions over the words and

the vocabulary, where remember every topic is a different cluster.

And then every document was assigned to a cluster just as before.

But in forming that assignment the score of the document under the cluster was

computed by just looking at a bag of words representation of the documents.

So just an unordered set of the words that appear in that document.

And then scoring the words under the specific topic distribution.

And here, just like in the mixture models we described previously, every cluster or

topic in this case has a specific prevalence in the overall corpus.

So this is distribution over topics that appear in the entire corpus.

So in this module, we compared and contrast this clustering model

with the mixture of Gaussian clustering model we presented in the third module.

And then we turned to the LDA model itself.

Where here, every word in every document

had an assignment variable linking that word to a specific topic.

So then when we think about scoring a document in LDA.

We think of scoring every word under its associated topic.

Where these topics are defined exactly like in the alternative questioning we

just described.

Where there's a distribution over every word and

a vocabulary specific to the topic.

But the fact that there's a topic indicator per word in the document,

rather than per document.

It's not the only thing that distinguishes this model from the clustering model we

just described.

The other thing is, we introduced this topic proportion vector specific to

each document rather than it representing corpus wide topic proportions.

And this is really one of the key aspects of LDA,

because this forms our mixed membership representation of every document.

So a document doesn't belong to just one topic,

it belongs to a collection of topics.

And there are different weights on how much membership the document has in each

one of these topics.

And in this module, we described how we can think of

these topic proportions as a learned feature representation.

Where we can use it to do things like allocating this article to

multiple sections on a news website or using it to relate different articles

to one another or using it to learn user's preferences over different topics.

And likewise,

we talked about how we can think of looking at these topic distributions over

the vocabulary to describe post facto what these topics are really about.

So these are the types of inferences we can draw from LDA.

But the question is how do we learn this structure from data.

And just like in clustering this is a fully unsupervised task,

where we just provide a set of words in a set of document sin the corpus.

And somehow from this we want to extract out these

topic vocabulary distributions and these document topic proportions.

And critical to doing this,

just like in Is inferring that assignments of the words to specific topics.

But in this module, we describe that LDA is specified as a Theseum model.

And so we described a Bayesian inference procedure for

learning our model parameters, as well as these assignment variables.

And the algorithm we described was called Gibbs sampling.

And at first we presented a vanilla version of Gibbs sampling where we simply

iterate between all these assignment variables and model parameters.

Randomly reassigning each conditioned on the instantiated values of

all the other parameters or variables.

So at first, we could think about randomly reassigning the topics for

every word in a document.

And then we can think about fixing these and

sampling the topic proportion vector for that specific document.

And then repeating these steps for all documents in the corpus.

And then having fixed these values we can think about resampling the topic

vocabulary distributions.

But then in the module we described a little bit fancier version

of sampling that we can perform in LDA calle Collapsed Gibbs sampling.

Where we analytically integrate out over all these model parameters the topic

vocabulary distributions and these document specific topic proportions.

And we just sequentially sample each indicator variable of a given word to

a specific topic conditioned on all the other assignments made in that

document and every other document in the corpus.

And we went through a derivation of the form of this conditional distribution,

specifically there are two terms.

One is how much a given document likes this specific topic,

and the other is how much that topic likes a specific word considered.

And we said that we multiply those two terms together.

And then we think about renormalizing this

across all possible assignments that we could make.

And then we use that distribution to sample a new

topic indicator for that specific word.

Then we cycle through all words in the document and all documents in the corpus.

Finally, in this module we talked about how we can use the output of Gibbs

sampling to do Bayesian inference.

Remember if we're thinking about doing predictions in the Bayesian framework, we

want to integrate over our uncertainty in what value the model parameters can take.

So we talked about how we can take each one of our give samples form predictions

from that sample and then average across those samples.

Or alternatively and something that's very commonly done in practice is just

look at the one sample that maximise, what we call joint model probability and

then use that to draw inferences.