0:00

One subclass of Bayesian Networks is the class called as Naive Bayes or sometimes

even more derogatory, Idiot Bayes. As we'll see Naive Bayes models are

called that way because they make independence assumptions that indeed very

naive and orally simplistic. And yet they provide an interesting point

on the tradeoff curve, be, of model complexity that sometimes turns out to be

surprisingly useful. So here is a naive base model.

the small is typically used for classification that is taking an instance

where we have effectively observed a bunch of features.

And in most cases, although not necessarily, we're assuming that all of

these features are observed for each of the, for a given instance.

And our goal is to infer to which class among a limited set of classes a

particular instance belongs. So these are observed, these ones here

observed, and this one in general is hidden.

1:56

So if we look at the chain rule for Bayesian networks, as applied in this

context, we see that we have a joined distribution, P of C comma X one of the

XN, which can be written using a product form as a prior over the class variable

C, and a product of the conditional probabilities of each feature, given the

class. To understand this model a little bit

better, it helps us to look at the ratio between the probabilities of two

different classes, given a particular observation.

That is a particular assignment X1 of the Xn to the observed features.

So if we look at this ratio we can see that it can be broken down as a product

of two terms. The first is just the ratio of the prior

probabilities of the two classes. So that's the green term here and the

second is this blue term which of all the ratios as it's called.

That is the probability of seeing a particular observation xi in the context

of one class, relative to the context of the other class.

3:17

So let's look at an application of the naive base model in one of the places

where it's actually very commonly used, which is the context of text

classification. So imagine that we're trying to take a

document and figure out for that document to which category it belongs.

We have some set of categories in mind. So for example, is this a document about

pets, is it about finance, is it about vacations?

We have some set of categories and we'd like to assign a document to one of

those. It turns out that one can use two

different naive based models for tackling this problem.

The first of those is called the Bernoulli naive Bayes model for text.

And it, it treats every word in a dictionary so you open your dictionary,

and there's sev, several, you know, ten, maybe 10,000 words in that dictionary.

And so you have a random variable for every one of those words, or at least the

ones that occur in the kind of document that you're, you're interested in

thinking about, and for each word in the dictionary, we have a binary random

variable, which is one. If the word appears in the document.

4:33

And zero otherwise. If we have say, five thousand words that

we're interested in, and anything about we would have five thousand of these

binary variables, and so the propability associated, the CPD, associated with each

variable is, in this case the propability that the word appears say the propability

that the cat, the word cat appears, given the label of the document.

So, for example, if we had, two categories, that we're looking at right

now, say documents about finances versus documents about, documents about pets,

you would expect that in a document about pets the word cat is quite likely to

appear. and so I'm only showing the probability

of cats appearing. the probability of cats not appearing

would be 0.7. the probability that dog appears might be

0.4, and, and so on. Where as for document about finances,

we're less likely to see the word cat and dog that are more likely to have the

words buy and sell so the probability that buy appears might be considerably

higher in this case. So this is a Bernoulli naive based model

because first it, it its a Bernoulli model because each of these is a binary

variable with subject to a Bernoulli Distribution, so this is a Bernoulli

Distribution. And its naive based because it makes very

strong independence assumptions that the probability of one word appearing is

independent of sorry, the event of one word appearing is independent of the

event of a different word appearing given that we know the class.

And obviously that assumption is far too strong to be represented in the reality,

but it turns out to be a not bad approximation.

In terms of actual performance. A second model for the same problem is

what's called the multi-nomial naive base model for text.

In this case, the variables that represent the features are not the words

in the dictionary, but rather, words in the document.

So here, N is the length of the document. If you have a document that has 737

words, you're going to have 737 random variables.

And the value of each of these random variables is the actual word.

7:01

That is in the first second up to the Nth position in the document.

And so the, if you have say that same dictionary of 5,000 possible words, this

is no longer a binary random variable, but rather, one that has the values where

this is the size of a dictionary, say 5,000.

7:25

Now this might seem like a very complicated model, because the c, p, d

now needs to specify the probability distribution over words in the dictionary

for every possible word. Very possible position in the document.

So, we need to have a probability distribution over the words and word in

position one and position two and in position N.

But we're going to address that by assuming that these probability

distributions are all the same. And so the probability distribution for

over words in position one is the same a, as for words in position two, three and

so on. Now this is a multi nomial naive base

model, because notice that the parameterization, for each of these

words, is a multi nomial distribution. So what we see here, is not the same as

what we saw in the previous slide, where we had a bunch of binary random

variables. Here this is a multi nomial which means

that all these entries sum to one. And it's a naive based model because it

makes, again, a strong independence assumption, a different independence

assumption, in this case it makes the assumption that the word in position one

is independent of the word in position two given the class variable.

And once again, if you think about say two word phrases that care common this

assumption is clearly overly strong and yet it appears to be a good

approximation. In a variety of practical applications

and most notably, in the case of of natural, of this kind of document

classification where it's still quite commonly used.

So to summarize, naive beta actually provides us with a very simple approach

for classification problems. It's computationally very efficient.

And, the models are easy to construct. Whether by had or by machine learning

techniques. It turns out to be a surprisingly effect

method in domains that have a large number of weekly relevant features, such

as the textual lanes that we've talked about.

On the other hand, the strong independent assumptions that we talked about, the

conditional independence of different features given the class, reduce the

performance of these models, especially in cases when we have multiple, highly

correlated features.